Commits
Firstly we use bytes.IndexByte for faster newLinesIndices. On my machine this
reduces wall clock time of BenchmarkTagsToSections by 38%. This is faster
since bytes.IndexByte relies on CPU specific optimizations to find the next
new line (eg uses AVX2 if available).
Secondly we reuse nls slice between calls to tagsToSections. I noticed in the
profiler a nonsignificant chunk in the garbage collector. The slice built by
newLinesIndices is allocated and thrown away for each call to tagsToSections.
This means we can re-use it which this commit implements by introducing a
struct storing the buffer. We now use this buffer per shard of symbols we
analyse.
old time/op new time/op delta
188µs ± 7% 101µs ± 3% -46.10% (p=0.000 n=10+10)
old alloc/op new alloc/op delta
79.3kB ± 0% 36.3kB ± 0% -54.24% (p=0.000 n=9+10)
old allocs/op new allocs/op delta
443 ± 0% 441 ± 0% -0.45% (p=0.000 n=10+10)
Test Plan: go test -bench BenchmarkTagsToSections
I found the code a bit hard to read before, this I believe is more
clear. I was also hoping for an improvement in the benchmarks, but the
improvement was statistically insignificant.
Test Plan: go test
This change adds a benchmark for the conversion from ctags output to Zoekt
document data, plus a tiny optimization to presize the symbol slices.
This adds a monitor which will report every minute the progress of
symbol analysis. Additionally, if a document is taking too long to
analyse (10s) we report it.
At first this is just reporting via stdlog. However, once we are
comfortable with thresholds around this we can likely also include a way
to kill analysis for a file.
Test Plan: Adjusted monitorReportStatus to 1s then indexed the
sourcegraph repo and inspected the output
$ go run ./cmd/zoekt-git-index -require_ctags ../sourcegraph/
2023/11/03 16:03:10 attempting to index 14533 total files
2023/11/03 16:03:13 DEBUG: symbol analysis still running for shard statistics: duration=1s symbols=15805 bytes=44288971
2023/11/03 16:03:14 DEBUG: symbol analysis still running for shard statistics: duration=2s symbols=26189 bytes=51564417
2023/11/03 16:03:15 DEBUG: symbol analysis still running for shard statistics: duration=3s symbols=55613 bytes=64748084
2023/11/03 16:03:16 DEBUG: symbol analysis still running for shard statistics: duration=4s symbols=86557 bytes=93771404
2023/11/03 16:03:17 DEBUG: symbol analysis still running for shard statistics: duration=5s symbols=125352 bytes=116319453
2023/11/03 16:03:18 symbol analysis finished for shard statistics: duration=5s symbols=142951 bytes=129180023
2023/11/03 16:03:22 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 283983298 index bytes (overhead 2.8), 14533 files processed
I then added a random sleep for a minute in a file to see the stuck
reporting:
$ go run ./cmd/zoekt-git-index -require_ctags ../sourcegraph/
2023/11/03 16:14:57 attempting to index 14533 total files
2023/11/03 16:15:15 WARN: symbol analysis for README.md (3485 bytes) has been running for 14s
2023/11/03 16:15:25 WARN: symbol analysis for README.md (3485 bytes) has been running for 24s
2023/11/03 16:15:45 WARN: symbol analysis for README.md (3485 bytes) has been running for 44s
2023/11/03 16:16:00 DEBUG: symbol analysis still running for shard statistics: duration=1m0s symbols=958 bytes=624329
2023/11/03 16:16:00 symbol analysis for README.md (size 3485 bytes) is done and found 4 symbols
2023/11/03 16:16:06 symbol analysis finished for shard statistics: duration=1m5s symbols=142951 bytes=129180023
2023/11/03 16:16:10 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 283983299 index bytes (overhead 2.8), 14533 files processed
This change refactors our end-to-end scoring tests and enables local testing
using the scip-ctags binary:
* Split scoring tests out of `e2e_test` and into their own file `scoring_test`
* Split huge test methods into targeted ones like `TestFileNameMatch`,
`TestJava`, `TestGo`, etc.
* For languages that scip-ctags supports, rerun the same cases using the
scip-ctags binary
To run scip-ctags tests locally, you can set the env variable
```
SCIP_CTAGS_COMMAND=<sourcegraph-repo>/dev/scip-ctags-dev
```
This doesn't yet update Zoekt CI to run scip-ctags tests. That will be tackled
in a follow-up.
On large repos, indexing might take quite a while and hit the indexing timeout.
This change helps debug these situations:
* Make the indexing timeout configurable through an env variable
`INDEXING_TIMEOUT`
* Add more info to progress logging: log the total number of files being
indexed, plus the file count per shard
SCIP ctags can output different kind names than universal-ctags (for example
`typeAlias` instead of `talias`). This change makes sure we handle different
names for the same kind. To do so, it refactors the logic so we first match
strings to standard kinds, then decide how these are scored for each language.
That way, you don't need to remember to cover all the possible kind names each
time you adjust scoring for a new language.
Also added basic tests for Ruby and Python to ensure we don't accidentally
change the scoring.
Right now our symbol analyser doesn't tell us if a symbol is exported. We add
a go specific tweak here to boost those results. Ideally this could be
something that is encoded in the symbol information.
Additionally we do downrank _test.go files via the doc-order. But in the case
of symbol matches the boosting overweighs doc order signficantly. I found the
extra downraking quite useful when experimenting.
Test Plan: lots of manual testing on the keyword branch
Right now we boost a file extension that hasn't been seen to the 3rd
position. This is gated by an environment variable which defaults to
on. I want to explore if there are ways we can turn on this behaviour
with the query language.
Test Plan: go run ./cmd/zoekt foo
Tiny change to use a type def, as we usually prefer those over type aliases.
We bump go-ctags to include markdown files in symbol analysis. We
already have ranking behaviour for markdown files but due to go-ctags it
won't be taken into account (until this commit).
Additionally we bump the version included in the docker container to
v6.0.0.
Test Plan: CI and "docker build -t zoekt ."
We add a bit more complexity to addScore but avoid allocating a string if debugScore = false.
Test plan:
I ran a couple of benchmarks and confirmed we don't allocate.
Relates to #667.
This updates the expected scores, now that we have removed the repetition boost.
Tiny change to display the atom count. I caught myself several times trying to
translate the score to an atom count.
Test plan:
Ran it locally
I don't remember if there was a good reason to score methods and
functions differently, but I have found examples where the difference
between the two scores causes the better result to be lower ranked,
although it has better file-order boost.
Scip-ctags distinguishes between methods and functions, universal-ctags
does not.
Test plan:
updated score tests
The definition of how this is applied is very narrow and more often than
not works poorly. Originally I sent out a commit to dampen this using
log, but Julie suggested just removing which sounds better.
Test Plan: go test
Because "type" in scip-ctags matches so much we can't boost it as high
as we do for classes/structs in our language specific boosting. In the
case of GraphQL this is useful to boost.
Test Plan: A search like "type User" in the sourcegraph codebase now
includes results from the schema.graphql files.
Very useful for quick iteration on ranking work.
Test Plan: go run ./cmd/zoekt -debug foo
* temporarily use log branch
* logging: remove description param in Scoped
* use latest log and mountinfo versions
There are likely more that we are missing. We only briefly looked at
golang, likely other languages are also affected. The regression is
scip-ctags returning different values for kind.
Will follow-up with a more comprehensive workaround until scip-ctags
issues are resolved.
Test Plan: manually tested queries like "test server" against the golang
repo until the test server struct was the top result.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
This field is no longer requested in Sourcegraph and has been deprecated
for a while.
Test Plan: go test ./...
Previously we enforced the binary was called universal-ctags. However,
we let users override the name of the binary, so if they override it we
should use it. To prevent footguns, we now validate the binary was built
with the interactive feature.
In the case of scip-ctags we use the old validation of just checking the
name.
Additionally we fix a bug that was introduced where if symbols are
optional we continue if parsing fails.
Test Plan: indexed with and without universal-ctags. Ctags on my mbp
points to something from xcode which wouldn't work
$ CTAGS_COMMAND=ctags go run ./cmd/zoekt-git-index -require_ctags .
2023/10/04 17:08:12 indexGitRepo(/Users/keegan/src/github.com/sourcegraph/zoekt, delta=false): build.NewBuilder: ctags.NewParserMap: ctags binary is not universal-ctags or is not compiled with +interactive feature: bin=ctags
exit status 1
$ CTAGS_COMMAND=universal-ctags go run ./cmd/zoekt-git-index -require_ctags .
2023/10/04 17:08:29 finished github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt: 8657338 index bytes (overhead 2.9)
$ CTAGS_COMMAND=ctags go run ./cmd/zoekt-git-index .
2023/10/04 17:08:40 finished github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt: 8538246 index bytes (overhead 2.9)
Right now we use panicf which leads a stack trace which is misleading at
what is happening and fills up the space used by kubernetes error
reporting. Additionally a few times we have had bug reports about the
watchdog failing. This commit updates the message to be far more
informative about next steps.
Additionally we update the watchdog error to include the response body
in case that contains useful information for debugging.
Test Plan: Updated the serveHealthz handler to always return an error.
Then ran the following
$ ZOEKT_WATCHDOG_TICK=1s go run ./cmd/zoekt-webserver
2023/09/14 15:55:27 custom ZOEKT_WATCHDOG_TICK=1s
2023/09/14 15:55:27 loading 1 shard(s): github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt
2023/09/14 15:55:28 watchdog: failed, will try 2 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:29 watchdog: failed, will try 1 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:30 watchdog health check has consecutively failed 3 times indicating is likely an unrecoverable error affecting zoekt. As such this process will exit with code 3.
Final error: watchdog: status=500 body="not ready: boom\n"
Possible Remediations:
- If this rarely happens, ignore and let your process manager restart zoekt.
- Possibly under provisioned. Try increasing CPU or disk IO.
- A bug. Reach out with logs and screenshots of metrics when this occurs.
exit status 3
This probably caused our tmp-dirs to run full. In production we saw data
from years ago taking up disk space.
Co-authored-by: Petri-Johan Last <petri.last@sourcegraph.com>
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
This was a nice to have when we investigated search times. A lot of the
stats we have don't really capture slow downs due to IO/etc, so a rough
time based metric will be useful to understand how much time we spend
actually searching vs deciding if we should search.
Test Plan: go test ./...
If the doc limit was triggered, the match limit was never evaluated,
even if it could have further reduced the FileMatches.
This moves all decisions about if we should do display limits and how to
enforce them into one place. By default we make it stateful so it can be
used by the limitSender. But in every other place we will always sort
and then truncate, so we also introduce a convenience API for that.
The only potential behaviour change here is if you have no display
limits but had document ranks turned on, we would not sort matches
within a shard. However, the aggregator would still do sorting anyways
so this is really just a wasted sort. In practice a display limit should
always be set.
Test Plan: go test
All clients of zoekt have a shared problem: they have no reliable way to
bound the size of the SearchResult. The primary dimension that
determines the size of a SearchResult is the number of matches. None of
the existing levers zoekt provides sufficiently limit this size:
- MaxDocDisplayCount is a hard limit on the number of Files in the
SearchResult. But when a single File can have an arbitrary number of
matches for the query, you can still end up with enormous
SearchResults when this parameter is 1.
The existing *MaxMatchCount parameters are more about limiting the
amount of work zoekt does when executing queries than they are about
limiting the response size:
- TotalMaxMatchCount is a soft limit on the number of matches
across shards. But it is only evaluated after handling each shard, so
if a single shard has an enormous number of matches, the SearchResult
will be enormous.
- ShardMaxMatchCount is a soft limit on the number of matches from a
single shard. But it is only evaluated after handling each document, so
if a single document has an enormous number of matches, the
SearchResult will be enormous.
- ShardRepoMaxMatchCount, well, you get the idea.
Different clients have a differing ability to tolerate enormous
SearchResults. Sourcegraph, for example, is apparently doing just fine;
they put hard limits on the number of matches in their own server, which
is itself a client of zoekt. They're presumably able to tolerate large
responses from zoekt as it's running colocated in a datacenter
environment.
But clients that are, for example, running in browsers, and using the
less-compact JSON-encoded API, are much less able to cope with enormous
SearchResults, which can be multiple megabytes large even with the most
conservative applications of the existing parameters.
Enter MaxMatchDisplayCount, which has similar semantics to
MaxDocDisplayCount, and is used by zoekt in the exact same places as
that parameter. With this, clients can get a much better handle on the
size of zoekt SearchResults.
In #187 we collapsed the spans which caused all the LazyPrintf in defer
statements to be dropped.
In this PR I revert #187, deduplicate trace logs (EG we logged "num
files" and `opts` in multiple places) and log `stats` in streaming
search.
Test plan:
I confirmed locally that the logs show up as expected.
* Create buf-breaking-check.yml
buf breaking test for grpc proto file to prevent checking in breaking changes and maintaining backward compatibility
Follow up to: https://github.com/sourcegraph/sourcegraph/pull/55131
* Update buf-breaking-check.yml
add workflow_dispatch: {}
* Update .github/workflows/buf-breaking-check.yml
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
---------
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
This reverts commit b7e5070bfb563ea836e532dc524a733dd5d684c8. Initial
data from production shows we didn't improve performance so we are
reverting since the complicates without improving perf.
Test Plan: CI
The purpose of this commit is to reduce disk IO in case we skip a shard
because of missing ngrams.
To achieve this, we first check whether ALL ngrams exist in the shard before
loading the posting lists to determine their frequency. This means we have to
loop twice over the ngrams for the benefit of not loading any posting list in
case the shard would have been skipped anyways.
Test plan: This is a refactor, so relying on CI
The name always bothered me since we had fileNameNgrams as well. Now
they are both accessed via a btreeIndex, I also took the opportunity to
introduce a helper "ngrams" which returns the correct index depending on
if you want filenames or contents.
Test Plan: go test
This feature has been disabled by default for a long time.
Test Plan: CI
This step currently generates incomplete PRs since it hasn't been
updated to handle bazel yet. I'm unsure how to make it do that, so I'd
rather just remove it for now and add it back once we want to do that.
Test Plan: CI
Reintroducing this optimization which we lost when we sorted ngrams.
We believe this will improve performance of the btree lookups. We are
investigating this to make it faster to rule out a shard (when freq==0).
Testing locally on a large corpus we halved the time spent in IO.
Locally Sort shows up in the profiles significantly, but there are two
facts mitigating that:
- Locally my file page cache is primed so IO rarely is going to disk.
- We likely will implement an IR for Zoekt which will amortize the Sort
to once per search rather than once per shard.
Test Plan: go test ./... and performance profiling via via ./cmd/zoekt.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
Firstly we use bytes.IndexByte for faster newLinesIndices. On my machine this
reduces wall clock time of BenchmarkTagsToSections by 38%. This is faster
since bytes.IndexByte relies on CPU specific optimizations to find the next
new line (eg uses AVX2 if available).
Secondly we reuse nls slice between calls to tagsToSections. I noticed in the
profiler a nonsignificant chunk in the garbage collector. The slice built by
newLinesIndices is allocated and thrown away for each call to tagsToSections.
This means we can re-use it which this commit implements by introducing a
struct storing the buffer. We now use this buffer per shard of symbols we
analyse.
old time/op new time/op delta
188µs ± 7% 101µs ± 3% -46.10% (p=0.000 n=10+10)
old alloc/op new alloc/op delta
79.3kB ± 0% 36.3kB ± 0% -54.24% (p=0.000 n=9+10)
old allocs/op new allocs/op delta
443 ± 0% 441 ± 0% -0.45% (p=0.000 n=10+10)
Test Plan: go test -bench BenchmarkTagsToSections
This adds a monitor which will report every minute the progress of
symbol analysis. Additionally, if a document is taking too long to
analyse (10s) we report it.
At first this is just reporting via stdlog. However, once we are
comfortable with thresholds around this we can likely also include a way
to kill analysis for a file.
Test Plan: Adjusted monitorReportStatus to 1s then indexed the
sourcegraph repo and inspected the output
$ go run ./cmd/zoekt-git-index -require_ctags ../sourcegraph/
2023/11/03 16:03:10 attempting to index 14533 total files
2023/11/03 16:03:13 DEBUG: symbol analysis still running for shard statistics: duration=1s symbols=15805 bytes=44288971
2023/11/03 16:03:14 DEBUG: symbol analysis still running for shard statistics: duration=2s symbols=26189 bytes=51564417
2023/11/03 16:03:15 DEBUG: symbol analysis still running for shard statistics: duration=3s symbols=55613 bytes=64748084
2023/11/03 16:03:16 DEBUG: symbol analysis still running for shard statistics: duration=4s symbols=86557 bytes=93771404
2023/11/03 16:03:17 DEBUG: symbol analysis still running for shard statistics: duration=5s symbols=125352 bytes=116319453
2023/11/03 16:03:18 symbol analysis finished for shard statistics: duration=5s symbols=142951 bytes=129180023
2023/11/03 16:03:22 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 283983298 index bytes (overhead 2.8), 14533 files processed
I then added a random sleep for a minute in a file to see the stuck
reporting:
$ go run ./cmd/zoekt-git-index -require_ctags ../sourcegraph/
2023/11/03 16:14:57 attempting to index 14533 total files
2023/11/03 16:15:15 WARN: symbol analysis for README.md (3485 bytes) has been running for 14s
2023/11/03 16:15:25 WARN: symbol analysis for README.md (3485 bytes) has been running for 24s
2023/11/03 16:15:45 WARN: symbol analysis for README.md (3485 bytes) has been running for 44s
2023/11/03 16:16:00 DEBUG: symbol analysis still running for shard statistics: duration=1m0s symbols=958 bytes=624329
2023/11/03 16:16:00 symbol analysis for README.md (size 3485 bytes) is done and found 4 symbols
2023/11/03 16:16:06 symbol analysis finished for shard statistics: duration=1m5s symbols=142951 bytes=129180023
2023/11/03 16:16:10 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 283983299 index bytes (overhead 2.8), 14533 files processed
This change refactors our end-to-end scoring tests and enables local testing
using the scip-ctags binary:
* Split scoring tests out of `e2e_test` and into their own file `scoring_test`
* Split huge test methods into targeted ones like `TestFileNameMatch`,
`TestJava`, `TestGo`, etc.
* For languages that scip-ctags supports, rerun the same cases using the
scip-ctags binary
To run scip-ctags tests locally, you can set the env variable
```
SCIP_CTAGS_COMMAND=<sourcegraph-repo>/dev/scip-ctags-dev
```
This doesn't yet update Zoekt CI to run scip-ctags tests. That will be tackled
in a follow-up.
On large repos, indexing might take quite a while and hit the indexing timeout.
This change helps debug these situations:
* Make the indexing timeout configurable through an env variable
`INDEXING_TIMEOUT`
* Add more info to progress logging: log the total number of files being
indexed, plus the file count per shard
SCIP ctags can output different kind names than universal-ctags (for example
`typeAlias` instead of `talias`). This change makes sure we handle different
names for the same kind. To do so, it refactors the logic so we first match
strings to standard kinds, then decide how these are scored for each language.
That way, you don't need to remember to cover all the possible kind names each
time you adjust scoring for a new language.
Also added basic tests for Ruby and Python to ensure we don't accidentally
change the scoring.
Right now our symbol analyser doesn't tell us if a symbol is exported. We add
a go specific tweak here to boost those results. Ideally this could be
something that is encoded in the symbol information.
Additionally we do downrank _test.go files via the doc-order. But in the case
of symbol matches the boosting overweighs doc order signficantly. I found the
extra downraking quite useful when experimenting.
Test Plan: lots of manual testing on the keyword branch
We bump go-ctags to include markdown files in symbol analysis. We
already have ranking behaviour for markdown files but due to go-ctags it
won't be taken into account (until this commit).
Additionally we bump the version included in the docker container to
v6.0.0.
Test Plan: CI and "docker build -t zoekt ."
I don't remember if there was a good reason to score methods and
functions differently, but I have found examples where the difference
between the two scores causes the better result to be lower ranked,
although it has better file-order boost.
Scip-ctags distinguishes between methods and functions, universal-ctags
does not.
Test plan:
updated score tests
There are likely more that we are missing. We only briefly looked at
golang, likely other languages are also affected. The regression is
scip-ctags returning different values for kind.
Will follow-up with a more comprehensive workaround until scip-ctags
issues are resolved.
Test Plan: manually tested queries like "test server" against the golang
repo until the test server struct was the top result.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
Previously we enforced the binary was called universal-ctags. However,
we let users override the name of the binary, so if they override it we
should use it. To prevent footguns, we now validate the binary was built
with the interactive feature.
In the case of scip-ctags we use the old validation of just checking the
name.
Additionally we fix a bug that was introduced where if symbols are
optional we continue if parsing fails.
Test Plan: indexed with and without universal-ctags. Ctags on my mbp
points to something from xcode which wouldn't work
$ CTAGS_COMMAND=ctags go run ./cmd/zoekt-git-index -require_ctags .
2023/10/04 17:08:12 indexGitRepo(/Users/keegan/src/github.com/sourcegraph/zoekt, delta=false): build.NewBuilder: ctags.NewParserMap: ctags binary is not universal-ctags or is not compiled with +interactive feature: bin=ctags
exit status 1
$ CTAGS_COMMAND=universal-ctags go run ./cmd/zoekt-git-index -require_ctags .
2023/10/04 17:08:29 finished github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt: 8657338 index bytes (overhead 2.9)
$ CTAGS_COMMAND=ctags go run ./cmd/zoekt-git-index .
2023/10/04 17:08:40 finished github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt: 8538246 index bytes (overhead 2.9)
Right now we use panicf which leads a stack trace which is misleading at
what is happening and fills up the space used by kubernetes error
reporting. Additionally a few times we have had bug reports about the
watchdog failing. This commit updates the message to be far more
informative about next steps.
Additionally we update the watchdog error to include the response body
in case that contains useful information for debugging.
Test Plan: Updated the serveHealthz handler to always return an error.
Then ran the following
$ ZOEKT_WATCHDOG_TICK=1s go run ./cmd/zoekt-webserver
2023/09/14 15:55:27 custom ZOEKT_WATCHDOG_TICK=1s
2023/09/14 15:55:27 loading 1 shard(s): github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt
2023/09/14 15:55:28 watchdog: failed, will try 2 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:29 watchdog: failed, will try 1 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:30 watchdog health check has consecutively failed 3 times indicating is likely an unrecoverable error affecting zoekt. As such this process will exit with code 3.
Final error: watchdog: status=500 body="not ready: boom\n"
Possible Remediations:
- If this rarely happens, ignore and let your process manager restart zoekt.
- Possibly under provisioned. Try increasing CPU or disk IO.
- A bug. Reach out with logs and screenshots of metrics when this occurs.
exit status 3
This moves all decisions about if we should do display limits and how to
enforce them into one place. By default we make it stateful so it can be
used by the limitSender. But in every other place we will always sort
and then truncate, so we also introduce a convenience API for that.
The only potential behaviour change here is if you have no display
limits but had document ranks turned on, we would not sort matches
within a shard. However, the aggregator would still do sorting anyways
so this is really just a wasted sort. In practice a display limit should
always be set.
Test Plan: go test
All clients of zoekt have a shared problem: they have no reliable way to
bound the size of the SearchResult. The primary dimension that
determines the size of a SearchResult is the number of matches. None of
the existing levers zoekt provides sufficiently limit this size:
- MaxDocDisplayCount is a hard limit on the number of Files in the
SearchResult. But when a single File can have an arbitrary number of
matches for the query, you can still end up with enormous
SearchResults when this parameter is 1.
The existing *MaxMatchCount parameters are more about limiting the
amount of work zoekt does when executing queries than they are about
limiting the response size:
- TotalMaxMatchCount is a soft limit on the number of matches
across shards. But it is only evaluated after handling each shard, so
if a single shard has an enormous number of matches, the SearchResult
will be enormous.
- ShardMaxMatchCount is a soft limit on the number of matches from a
single shard. But it is only evaluated after handling each document, so
if a single document has an enormous number of matches, the
SearchResult will be enormous.
- ShardRepoMaxMatchCount, well, you get the idea.
Different clients have a differing ability to tolerate enormous
SearchResults. Sourcegraph, for example, is apparently doing just fine;
they put hard limits on the number of matches in their own server, which
is itself a client of zoekt. They're presumably able to tolerate large
responses from zoekt as it's running colocated in a datacenter
environment.
But clients that are, for example, running in browsers, and using the
less-compact JSON-encoded API, are much less able to cope with enormous
SearchResults, which can be multiple megabytes large even with the most
conservative applications of the existing parameters.
Enter MaxMatchDisplayCount, which has similar semantics to
MaxDocDisplayCount, and is used by zoekt in the exact same places as
that parameter. With this, clients can get a much better handle on the
size of zoekt SearchResults.
In #187 we collapsed the spans which caused all the LazyPrintf in defer
statements to be dropped.
In this PR I revert #187, deduplicate trace logs (EG we logged "num
files" and `opts` in multiple places) and log `stats` in streaming
search.
Test plan:
I confirmed locally that the logs show up as expected.
* Create buf-breaking-check.yml
buf breaking test for grpc proto file to prevent checking in breaking changes and maintaining backward compatibility
Follow up to: https://github.com/sourcegraph/sourcegraph/pull/55131
* Update buf-breaking-check.yml
add workflow_dispatch: {}
* Update .github/workflows/buf-breaking-check.yml
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
---------
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
The purpose of this commit is to reduce disk IO in case we skip a shard
because of missing ngrams.
To achieve this, we first check whether ALL ngrams exist in the shard before
loading the posting lists to determine their frequency. This means we have to
loop twice over the ngrams for the benefit of not loading any posting list in
case the shard would have been skipped anyways.
Test plan: This is a refactor, so relying on CI
We believe this will improve performance of the btree lookups. We are
investigating this to make it faster to rule out a shard (when freq==0).
Testing locally on a large corpus we halved the time spent in IO.
Locally Sort shows up in the profiles significantly, but there are two
facts mitigating that:
- Locally my file page cache is primed so IO rarely is going to disk.
- We likely will implement an IR for Zoekt which will amortize the Sort
to once per search rather than once per shard.
Test Plan: go test ./... and performance profiling via via ./cmd/zoekt.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>