Commits
We add a bit more complexity to addScore but avoid allocating a string if debugScore = false.
Test plan:
I ran a couple of benchmarks and confirmed we don't allocate.
Relates to #667.
This updates the expected scores, now that we have removed the repetition boost.
Tiny change to display the atom count. I caught myself several times trying to
translate the score to an atom count.
Test plan:
Ran it locally
I don't remember if there was a good reason to score methods and
functions differently, but I have found examples where the difference
between the two scores causes the better result to be lower ranked,
although it has better file-order boost.
Scip-ctags distinguishes between methods and functions, universal-ctags
does not.
Test plan:
updated score tests
The definition of how this is applied is very narrow and more often than
not works poorly. Originally I sent out a commit to dampen this using
log, but Julie suggested just removing which sounds better.
Test Plan: go test
Because "type" in scip-ctags matches so much we can't boost it as high
as we do for classes/structs in our language specific boosting. In the
case of GraphQL this is useful to boost.
Test Plan: A search like "type User" in the sourcegraph codebase now
includes results from the schema.graphql files.
Very useful for quick iteration on ranking work.
Test Plan: go run ./cmd/zoekt -debug foo
* temporarily use log branch
* logging: remove description param in Scoped
* use latest log and mountinfo versions
There are likely more that we are missing. We only briefly looked at
golang, likely other languages are also affected. The regression is
scip-ctags returning different values for kind.
Will follow-up with a more comprehensive workaround until scip-ctags
issues are resolved.
Test Plan: manually tested queries like "test server" against the golang
repo until the test server struct was the top result.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
This field is no longer requested in Sourcegraph and has been deprecated
for a while.
Test Plan: go test ./...
Previously we enforced the binary was called universal-ctags. However,
we let users override the name of the binary, so if they override it we
should use it. To prevent footguns, we now validate the binary was built
with the interactive feature.
In the case of scip-ctags we use the old validation of just checking the
name.
Additionally we fix a bug that was introduced where if symbols are
optional we continue if parsing fails.
Test Plan: indexed with and without universal-ctags. Ctags on my mbp
points to something from xcode which wouldn't work
$ CTAGS_COMMAND=ctags go run ./cmd/zoekt-git-index -require_ctags .
2023/10/04 17:08:12 indexGitRepo(/Users/keegan/src/github.com/sourcegraph/zoekt, delta=false): build.NewBuilder: ctags.NewParserMap: ctags binary is not universal-ctags or is not compiled with +interactive feature: bin=ctags
exit status 1
$ CTAGS_COMMAND=universal-ctags go run ./cmd/zoekt-git-index -require_ctags .
2023/10/04 17:08:29 finished github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt: 8657338 index bytes (overhead 2.9)
$ CTAGS_COMMAND=ctags go run ./cmd/zoekt-git-index .
2023/10/04 17:08:40 finished github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt: 8538246 index bytes (overhead 2.9)
Right now we use panicf which leads a stack trace which is misleading at
what is happening and fills up the space used by kubernetes error
reporting. Additionally a few times we have had bug reports about the
watchdog failing. This commit updates the message to be far more
informative about next steps.
Additionally we update the watchdog error to include the response body
in case that contains useful information for debugging.
Test Plan: Updated the serveHealthz handler to always return an error.
Then ran the following
$ ZOEKT_WATCHDOG_TICK=1s go run ./cmd/zoekt-webserver
2023/09/14 15:55:27 custom ZOEKT_WATCHDOG_TICK=1s
2023/09/14 15:55:27 loading 1 shard(s): github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt
2023/09/14 15:55:28 watchdog: failed, will try 2 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:29 watchdog: failed, will try 1 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:30 watchdog health check has consecutively failed 3 times indicating is likely an unrecoverable error affecting zoekt. As such this process will exit with code 3.
Final error: watchdog: status=500 body="not ready: boom\n"
Possible Remediations:
- If this rarely happens, ignore and let your process manager restart zoekt.
- Possibly under provisioned. Try increasing CPU or disk IO.
- A bug. Reach out with logs and screenshots of metrics when this occurs.
exit status 3
This probably caused our tmp-dirs to run full. In production we saw data
from years ago taking up disk space.
Co-authored-by: Petri-Johan Last <petri.last@sourcegraph.com>
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
This was a nice to have when we investigated search times. A lot of the
stats we have don't really capture slow downs due to IO/etc, so a rough
time based metric will be useful to understand how much time we spend
actually searching vs deciding if we should search.
Test Plan: go test ./...
If the doc limit was triggered, the match limit was never evaluated,
even if it could have further reduced the FileMatches.
This moves all decisions about if we should do display limits and how to
enforce them into one place. By default we make it stateful so it can be
used by the limitSender. But in every other place we will always sort
and then truncate, so we also introduce a convenience API for that.
The only potential behaviour change here is if you have no display
limits but had document ranks turned on, we would not sort matches
within a shard. However, the aggregator would still do sorting anyways
so this is really just a wasted sort. In practice a display limit should
always be set.
Test Plan: go test
All clients of zoekt have a shared problem: they have no reliable way to
bound the size of the SearchResult. The primary dimension that
determines the size of a SearchResult is the number of matches. None of
the existing levers zoekt provides sufficiently limit this size:
- MaxDocDisplayCount is a hard limit on the number of Files in the
SearchResult. But when a single File can have an arbitrary number of
matches for the query, you can still end up with enormous
SearchResults when this parameter is 1.
The existing *MaxMatchCount parameters are more about limiting the
amount of work zoekt does when executing queries than they are about
limiting the response size:
- TotalMaxMatchCount is a soft limit on the number of matches
across shards. But it is only evaluated after handling each shard, so
if a single shard has an enormous number of matches, the SearchResult
will be enormous.
- ShardMaxMatchCount is a soft limit on the number of matches from a
single shard. But it is only evaluated after handling each document, so
if a single document has an enormous number of matches, the
SearchResult will be enormous.
- ShardRepoMaxMatchCount, well, you get the idea.
Different clients have a differing ability to tolerate enormous
SearchResults. Sourcegraph, for example, is apparently doing just fine;
they put hard limits on the number of matches in their own server, which
is itself a client of zoekt. They're presumably able to tolerate large
responses from zoekt as it's running colocated in a datacenter
environment.
But clients that are, for example, running in browsers, and using the
less-compact JSON-encoded API, are much less able to cope with enormous
SearchResults, which can be multiple megabytes large even with the most
conservative applications of the existing parameters.
Enter MaxMatchDisplayCount, which has similar semantics to
MaxDocDisplayCount, and is used by zoekt in the exact same places as
that parameter. With this, clients can get a much better handle on the
size of zoekt SearchResults.
In #187 we collapsed the spans which caused all the LazyPrintf in defer
statements to be dropped.
In this PR I revert #187, deduplicate trace logs (EG we logged "num
files" and `opts` in multiple places) and log `stats` in streaming
search.
Test plan:
I confirmed locally that the logs show up as expected.
* Create buf-breaking-check.yml
buf breaking test for grpc proto file to prevent checking in breaking changes and maintaining backward compatibility
Follow up to: https://github.com/sourcegraph/sourcegraph/pull/55131
* Update buf-breaking-check.yml
add workflow_dispatch: {}
* Update .github/workflows/buf-breaking-check.yml
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
---------
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
This reverts commit b7e5070bfb563ea836e532dc524a733dd5d684c8. Initial
data from production shows we didn't improve performance so we are
reverting since the complicates without improving perf.
Test Plan: CI
The purpose of this commit is to reduce disk IO in case we skip a shard
because of missing ngrams.
To achieve this, we first check whether ALL ngrams exist in the shard before
loading the posting lists to determine their frequency. This means we have to
loop twice over the ngrams for the benefit of not loading any posting list in
case the shard would have been skipped anyways.
Test plan: This is a refactor, so relying on CI
The name always bothered me since we had fileNameNgrams as well. Now
they are both accessed via a btreeIndex, I also took the opportunity to
introduce a helper "ngrams" which returns the correct index depending on
if you want filenames or contents.
Test Plan: go test
This feature has been disabled by default for a long time.
Test Plan: CI
This step currently generates incomplete PRs since it hasn't been
updated to handle bazel yet. I'm unsure how to make it do that, so I'd
rather just remove it for now and add it back once we want to do that.
Test Plan: CI
Reintroducing this optimization which we lost when we sorted ngrams.
We believe this will improve performance of the btree lookups. We are
investigating this to make it faster to rule out a shard (when freq==0).
Testing locally on a large corpus we halved the time spent in IO.
Locally Sort shows up in the profiles significantly, but there are two
facts mitigating that:
- Locally my file page cache is primed so IO rarely is going to disk.
- We likely will implement an IR for Zoekt which will amortize the Sort
to once per search rather than once per shard.
Test Plan: go test ./... and performance profiling via via ./cmd/zoekt.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
We have been running btree as the default for many months. We worried
about a performance hit, but it never happened. After some recent local
testing I did I noticed the btree actually interacted with the disk more
efficiently. So the old code both uses more memory and is slower, lets
just remove it.
Test Plan: go test ./...
Useful in local testing to capture both on and off cpu time spent.
Should consider shipping this in the webserver as well.
Test Plan: ran zoekt with -full_profile flag and inspected output in
pprof.
Test Plan: go test ./...
When the user specifies an invalid mirror config that contains an entry
without any of the GithubUser / GithubOrg / ... repository definitions,
zoekt-indexserver would crash, because `cmd` stays `nil` and causes a
SIGSEGV when it is later accessed in `main.executeMirror`.
Noticed we didn't include these fields in our prometheus and logging
integration.
Test Plan: go test ./...
The variable is `Version`, not `Branch`, and if we're using the template
variable syntax, it should have the leading dot that it needs to.
This instruments our matchiter code to track how often is calls Get on
the underlying ngram index. We have some queries on long strings which
are going much slower than expected and we suspect it is related to the
work done here.
This change requires a few moving pieces:
- Adding the int to Stats
- Instrumenting places we log/emit Stats
- grpc support
- capturing this work in matchiter code
- updating tests to ensure correctness
The last two points are the meat of this commit, while the rest was
mechanical work. I reworked the TestAndSearch since into TestSearchStats
since it was the only test which had useful assertions on Stats.
I was surprised to see how high the ngram lookup values were for such
simple queries. This makes me more confident in improving our
performance here.
Test Plan: go test
We now call updateStats upto twice per shard search. The intention of
this is to capture statistics before pruning the matchtree. Previously
we would of done work in creating a matchtree but would then prune those
items away and would then never capture those statistics.
In practice that work was reading just one or two varint (the size of a
posting list) so likely had minimal impact on the reported statistics.
However, in the next commit we want to introduce a statistic which is
recorded even if we generate a noMatchTree.
The main technical part of this commit is ensuring all existing
updateStats functions can be called twice without overcounting.
Test Plan: go test
Instead of checking stdout, we check the combined output. By default, git fetch writes to stderr.
Previously this was only populated by the web package for serving a
specific endpoint. We now populate it in all places we do aggregation.
Test Plan: updated unit tests. Additionally loaded zoekt-webserver then
added to the index a repository which required 2 shards to index. The
repository count shown only increased by 1.
I don't remember if there was a good reason to score methods and
functions differently, but I have found examples where the difference
between the two scores causes the better result to be lower ranked,
although it has better file-order boost.
Scip-ctags distinguishes between methods and functions, universal-ctags
does not.
Test plan:
updated score tests
There are likely more that we are missing. We only briefly looked at
golang, likely other languages are also affected. The regression is
scip-ctags returning different values for kind.
Will follow-up with a more comprehensive workaround until scip-ctags
issues are resolved.
Test Plan: manually tested queries like "test server" against the golang
repo until the test server struct was the top result.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
Previously we enforced the binary was called universal-ctags. However,
we let users override the name of the binary, so if they override it we
should use it. To prevent footguns, we now validate the binary was built
with the interactive feature.
In the case of scip-ctags we use the old validation of just checking the
name.
Additionally we fix a bug that was introduced where if symbols are
optional we continue if parsing fails.
Test Plan: indexed with and without universal-ctags. Ctags on my mbp
points to something from xcode which wouldn't work
$ CTAGS_COMMAND=ctags go run ./cmd/zoekt-git-index -require_ctags .
2023/10/04 17:08:12 indexGitRepo(/Users/keegan/src/github.com/sourcegraph/zoekt, delta=false): build.NewBuilder: ctags.NewParserMap: ctags binary is not universal-ctags or is not compiled with +interactive feature: bin=ctags
exit status 1
$ CTAGS_COMMAND=universal-ctags go run ./cmd/zoekt-git-index -require_ctags .
2023/10/04 17:08:29 finished github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt: 8657338 index bytes (overhead 2.9)
$ CTAGS_COMMAND=ctags go run ./cmd/zoekt-git-index .
2023/10/04 17:08:40 finished github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt: 8538246 index bytes (overhead 2.9)
Right now we use panicf which leads a stack trace which is misleading at
what is happening and fills up the space used by kubernetes error
reporting. Additionally a few times we have had bug reports about the
watchdog failing. This commit updates the message to be far more
informative about next steps.
Additionally we update the watchdog error to include the response body
in case that contains useful information for debugging.
Test Plan: Updated the serveHealthz handler to always return an error.
Then ran the following
$ ZOEKT_WATCHDOG_TICK=1s go run ./cmd/zoekt-webserver
2023/09/14 15:55:27 custom ZOEKT_WATCHDOG_TICK=1s
2023/09/14 15:55:27 loading 1 shard(s): github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt
2023/09/14 15:55:28 watchdog: failed, will try 2 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:29 watchdog: failed, will try 1 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:30 watchdog health check has consecutively failed 3 times indicating is likely an unrecoverable error affecting zoekt. As such this process will exit with code 3.
Final error: watchdog: status=500 body="not ready: boom\n"
Possible Remediations:
- If this rarely happens, ignore and let your process manager restart zoekt.
- Possibly under provisioned. Try increasing CPU or disk IO.
- A bug. Reach out with logs and screenshots of metrics when this occurs.
exit status 3
This moves all decisions about if we should do display limits and how to
enforce them into one place. By default we make it stateful so it can be
used by the limitSender. But in every other place we will always sort
and then truncate, so we also introduce a convenience API for that.
The only potential behaviour change here is if you have no display
limits but had document ranks turned on, we would not sort matches
within a shard. However, the aggregator would still do sorting anyways
so this is really just a wasted sort. In practice a display limit should
always be set.
Test Plan: go test
All clients of zoekt have a shared problem: they have no reliable way to
bound the size of the SearchResult. The primary dimension that
determines the size of a SearchResult is the number of matches. None of
the existing levers zoekt provides sufficiently limit this size:
- MaxDocDisplayCount is a hard limit on the number of Files in the
SearchResult. But when a single File can have an arbitrary number of
matches for the query, you can still end up with enormous
SearchResults when this parameter is 1.
The existing *MaxMatchCount parameters are more about limiting the
amount of work zoekt does when executing queries than they are about
limiting the response size:
- TotalMaxMatchCount is a soft limit on the number of matches
across shards. But it is only evaluated after handling each shard, so
if a single shard has an enormous number of matches, the SearchResult
will be enormous.
- ShardMaxMatchCount is a soft limit on the number of matches from a
single shard. But it is only evaluated after handling each document, so
if a single document has an enormous number of matches, the
SearchResult will be enormous.
- ShardRepoMaxMatchCount, well, you get the idea.
Different clients have a differing ability to tolerate enormous
SearchResults. Sourcegraph, for example, is apparently doing just fine;
they put hard limits on the number of matches in their own server, which
is itself a client of zoekt. They're presumably able to tolerate large
responses from zoekt as it's running colocated in a datacenter
environment.
But clients that are, for example, running in browsers, and using the
less-compact JSON-encoded API, are much less able to cope with enormous
SearchResults, which can be multiple megabytes large even with the most
conservative applications of the existing parameters.
Enter MaxMatchDisplayCount, which has similar semantics to
MaxDocDisplayCount, and is used by zoekt in the exact same places as
that parameter. With this, clients can get a much better handle on the
size of zoekt SearchResults.
In #187 we collapsed the spans which caused all the LazyPrintf in defer
statements to be dropped.
In this PR I revert #187, deduplicate trace logs (EG we logged "num
files" and `opts` in multiple places) and log `stats` in streaming
search.
Test plan:
I confirmed locally that the logs show up as expected.
* Create buf-breaking-check.yml
buf breaking test for grpc proto file to prevent checking in breaking changes and maintaining backward compatibility
Follow up to: https://github.com/sourcegraph/sourcegraph/pull/55131
* Update buf-breaking-check.yml
add workflow_dispatch: {}
* Update .github/workflows/buf-breaking-check.yml
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
---------
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
The purpose of this commit is to reduce disk IO in case we skip a shard
because of missing ngrams.
To achieve this, we first check whether ALL ngrams exist in the shard before
loading the posting lists to determine their frequency. This means we have to
loop twice over the ngrams for the benefit of not loading any posting list in
case the shard would have been skipped anyways.
Test plan: This is a refactor, so relying on CI
We believe this will improve performance of the btree lookups. We are
investigating this to make it faster to rule out a shard (when freq==0).
Testing locally on a large corpus we halved the time spent in IO.
Locally Sort shows up in the profiles significantly, but there are two
facts mitigating that:
- Locally my file page cache is primed so IO rarely is going to disk.
- We likely will implement an IR for Zoekt which will amortize the Sort
to once per search rather than once per shard.
Test Plan: go test ./... and performance profiling via via ./cmd/zoekt.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
We have been running btree as the default for many months. We worried
about a performance hit, but it never happened. After some recent local
testing I did I noticed the btree actually interacted with the disk more
efficiently. So the old code both uses more memory and is slower, lets
just remove it.
Test Plan: go test ./...
This instruments our matchiter code to track how often is calls Get on
the underlying ngram index. We have some queries on long strings which
are going much slower than expected and we suspect it is related to the
work done here.
This change requires a few moving pieces:
- Adding the int to Stats
- Instrumenting places we log/emit Stats
- grpc support
- capturing this work in matchiter code
- updating tests to ensure correctness
The last two points are the meat of this commit, while the rest was
mechanical work. I reworked the TestAndSearch since into TestSearchStats
since it was the only test which had useful assertions on Stats.
I was surprised to see how high the ngram lookup values were for such
simple queries. This makes me more confident in improving our
performance here.
Test Plan: go test
We now call updateStats upto twice per shard search. The intention of
this is to capture statistics before pruning the matchtree. Previously
we would of done work in creating a matchtree but would then prune those
items away and would then never capture those statistics.
In practice that work was reading just one or two varint (the size of a
posting list) so likely had minimal impact on the reported statistics.
However, in the next commit we want to introduce a statistic which is
recorded even if we generate a noMatchTree.
The main technical part of this commit is ensuring all existing
updateStats functions can be called twice without overcounting.
Test Plan: go test
Previously this was only populated by the web package for serving a
specific endpoint. We now populate it in all places we do aggregation.
Test Plan: updated unit tests. Additionally loaded zoekt-webserver then
added to the index a repository which required 2 shards to index. The
repository count shown only increased by 1.