Commits
Right now we use panicf which leads a stack trace which is misleading at
what is happening and fills up the space used by kubernetes error
reporting. Additionally a few times we have had bug reports about the
watchdog failing. This commit updates the message to be far more
informative about next steps.
Additionally we update the watchdog error to include the response body
in case that contains useful information for debugging.
Test Plan: Updated the serveHealthz handler to always return an error.
Then ran the following
$ ZOEKT_WATCHDOG_TICK=1s go run ./cmd/zoekt-webserver
2023/09/14 15:55:27 custom ZOEKT_WATCHDOG_TICK=1s
2023/09/14 15:55:27 loading 1 shard(s): github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt
2023/09/14 15:55:28 watchdog: failed, will try 2 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:29 watchdog: failed, will try 1 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:30 watchdog health check has consecutively failed 3 times indicating is likely an unrecoverable error affecting zoekt. As such this process will exit with code 3.
Final error: watchdog: status=500 body="not ready: boom\n"
Possible Remediations:
- If this rarely happens, ignore and let your process manager restart zoekt.
- Possibly under provisioned. Try increasing CPU or disk IO.
- A bug. Reach out with logs and screenshots of metrics when this occurs.
exit status 3
This probably caused our tmp-dirs to run full. In production we saw data
from years ago taking up disk space.
Co-authored-by: Petri-Johan Last <petri.last@sourcegraph.com>
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
This was a nice to have when we investigated search times. A lot of the
stats we have don't really capture slow downs due to IO/etc, so a rough
time based metric will be useful to understand how much time we spend
actually searching vs deciding if we should search.
Test Plan: go test ./...
If the doc limit was triggered, the match limit was never evaluated,
even if it could have further reduced the FileMatches.
This moves all decisions about if we should do display limits and how to
enforce them into one place. By default we make it stateful so it can be
used by the limitSender. But in every other place we will always sort
and then truncate, so we also introduce a convenience API for that.
The only potential behaviour change here is if you have no display
limits but had document ranks turned on, we would not sort matches
within a shard. However, the aggregator would still do sorting anyways
so this is really just a wasted sort. In practice a display limit should
always be set.
Test Plan: go test
All clients of zoekt have a shared problem: they have no reliable way to
bound the size of the SearchResult. The primary dimension that
determines the size of a SearchResult is the number of matches. None of
the existing levers zoekt provides sufficiently limit this size:
- MaxDocDisplayCount is a hard limit on the number of Files in the
SearchResult. But when a single File can have an arbitrary number of
matches for the query, you can still end up with enormous
SearchResults when this parameter is 1.
The existing *MaxMatchCount parameters are more about limiting the
amount of work zoekt does when executing queries than they are about
limiting the response size:
- TotalMaxMatchCount is a soft limit on the number of matches
across shards. But it is only evaluated after handling each shard, so
if a single shard has an enormous number of matches, the SearchResult
will be enormous.
- ShardMaxMatchCount is a soft limit on the number of matches from a
single shard. But it is only evaluated after handling each document, so
if a single document has an enormous number of matches, the
SearchResult will be enormous.
- ShardRepoMaxMatchCount, well, you get the idea.
Different clients have a differing ability to tolerate enormous
SearchResults. Sourcegraph, for example, is apparently doing just fine;
they put hard limits on the number of matches in their own server, which
is itself a client of zoekt. They're presumably able to tolerate large
responses from zoekt as it's running colocated in a datacenter
environment.
But clients that are, for example, running in browsers, and using the
less-compact JSON-encoded API, are much less able to cope with enormous
SearchResults, which can be multiple megabytes large even with the most
conservative applications of the existing parameters.
Enter MaxMatchDisplayCount, which has similar semantics to
MaxDocDisplayCount, and is used by zoekt in the exact same places as
that parameter. With this, clients can get a much better handle on the
size of zoekt SearchResults.
In #187 we collapsed the spans which caused all the LazyPrintf in defer
statements to be dropped.
In this PR I revert #187, deduplicate trace logs (EG we logged "num
files" and `opts` in multiple places) and log `stats` in streaming
search.
Test plan:
I confirmed locally that the logs show up as expected.
* Create buf-breaking-check.yml
buf breaking test for grpc proto file to prevent checking in breaking changes and maintaining backward compatibility
Follow up to: https://github.com/sourcegraph/sourcegraph/pull/55131
* Update buf-breaking-check.yml
add workflow_dispatch: {}
* Update .github/workflows/buf-breaking-check.yml
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
---------
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
This reverts commit b7e5070bfb563ea836e532dc524a733dd5d684c8. Initial
data from production shows we didn't improve performance so we are
reverting since the complicates without improving perf.
Test Plan: CI
The purpose of this commit is to reduce disk IO in case we skip a shard
because of missing ngrams.
To achieve this, we first check whether ALL ngrams exist in the shard before
loading the posting lists to determine their frequency. This means we have to
loop twice over the ngrams for the benefit of not loading any posting list in
case the shard would have been skipped anyways.
Test plan: This is a refactor, so relying on CI
The name always bothered me since we had fileNameNgrams as well. Now
they are both accessed via a btreeIndex, I also took the opportunity to
introduce a helper "ngrams" which returns the correct index depending on
if you want filenames or contents.
Test Plan: go test
This feature has been disabled by default for a long time.
Test Plan: CI
This step currently generates incomplete PRs since it hasn't been
updated to handle bazel yet. I'm unsure how to make it do that, so I'd
rather just remove it for now and add it back once we want to do that.
Test Plan: CI
Reintroducing this optimization which we lost when we sorted ngrams.
We believe this will improve performance of the btree lookups. We are
investigating this to make it faster to rule out a shard (when freq==0).
Testing locally on a large corpus we halved the time spent in IO.
Locally Sort shows up in the profiles significantly, but there are two
facts mitigating that:
- Locally my file page cache is primed so IO rarely is going to disk.
- We likely will implement an IR for Zoekt which will amortize the Sort
to once per search rather than once per shard.
Test Plan: go test ./... and performance profiling via via ./cmd/zoekt.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
We have been running btree as the default for many months. We worried
about a performance hit, but it never happened. After some recent local
testing I did I noticed the btree actually interacted with the disk more
efficiently. So the old code both uses more memory and is slower, lets
just remove it.
Test Plan: go test ./...
Useful in local testing to capture both on and off cpu time spent.
Should consider shipping this in the webserver as well.
Test Plan: ran zoekt with -full_profile flag and inspected output in
pprof.
Test Plan: go test ./...
When the user specifies an invalid mirror config that contains an entry
without any of the GithubUser / GithubOrg / ... repository definitions,
zoekt-indexserver would crash, because `cmd` stays `nil` and causes a
SIGSEGV when it is later accessed in `main.executeMirror`.
Noticed we didn't include these fields in our prometheus and logging
integration.
Test Plan: go test ./...
The variable is `Version`, not `Branch`, and if we're using the template
variable syntax, it should have the leading dot that it needs to.
This instruments our matchiter code to track how often is calls Get on
the underlying ngram index. We have some queries on long strings which
are going much slower than expected and we suspect it is related to the
work done here.
This change requires a few moving pieces:
- Adding the int to Stats
- Instrumenting places we log/emit Stats
- grpc support
- capturing this work in matchiter code
- updating tests to ensure correctness
The last two points are the meat of this commit, while the rest was
mechanical work. I reworked the TestAndSearch since into TestSearchStats
since it was the only test which had useful assertions on Stats.
I was surprised to see how high the ngram lookup values were for such
simple queries. This makes me more confident in improving our
performance here.
Test Plan: go test
We now call updateStats upto twice per shard search. The intention of
this is to capture statistics before pruning the matchtree. Previously
we would of done work in creating a matchtree but would then prune those
items away and would then never capture those statistics.
In practice that work was reading just one or two varint (the size of a
posting list) so likely had minimal impact on the reported statistics.
However, in the next commit we want to introduce a statistic which is
recorded even if we generate a noMatchTree.
The main technical part of this commit is ensuring all existing
updateStats functions can be called twice without overcounting.
Test Plan: go test
Instead of checking stdout, we check the combined output. By default, git fetch writes to stderr.
Previously this was only populated by the web package for serving a
specific endpoint. We now populate it in all places we do aggregation.
Test Plan: updated unit tests. Additionally loaded zoekt-webserver then
added to the index a repository which required 2 shards to index. The
repository count shown only increased by 1.
I think this is our older ranking stuff.
Test Plan: go test
This would of caught the nil panic which was fixed in the last commit.
Test Plan: go test
We missed this path in the last PR, since we assumed we did an RPC to
get this information. This change has a slightly larger blast radius
because indexserver needs to read back the index file to find out the
index time. Previously it just used inputs to the index when informing
sourcegraph.
Test Plan: pretty robust unit tests here.
SG CI depends on the docker images being present. Now that our docker
images take longer to build with scip the race is easier to hit.
Test Plan: monitor main
We want this information in Sourcegraph. This turned into a bit more of
a hairy change due to our custom marshalling (for perf) and grpc
validation.
Test Plan: lots of unit test coverage
* Add scip-ctags to docker
* musl-ify
* Hardcode commit
* Pin dockerfile git commit
* Use archive
* Bump commit
This expands the debug output for keyword scoring to include the main elements of the score. This will help debug some surprising behavior where scores are quite different on dot com vs. S2.
This adds a gRPC API alongside the existing gob API.
It is enabled whenever the RPC setting is enabled. I didn't think it made sense to have a separate setting to enable it. It is only used when a gRPC request is detected (Content-Type: application/grpc). Eventually, we should likely open a separate port for gRPC traffic, but this should be okay for now.
In order to minimize the footprint of this change, we only use the protobuf definitions in the RPC layer. They are translated to/from the existing go types. This results in a small perf penalty.
* Add alternate ctags parser and language map
* Fix minor issues
* Lint protobuf
* Fix nil access
* Remove debug statements
* Small fixes
* ctagsAddSymbols -> ctagsAddSymbolsParser
* Split out parser logic
* Fix reviews
We never test or use this code path. I am also unaware of any users of
this code. If we hear complaints I'd recommend adding this code back in
a way that conforms to the parser interface.
Test Plan: go test
This PR adds an experimental option `UseKeywordScoring` for scoring file matches using [BM25](https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/), the most common approach in keyword search. It treats each match in a file as a term and uses term frequencies to compute an approximation to BM25.
For now, it makes several simplifications:
* We drop inverse document frequency (idf) from the formula, since we don't have access to global term statistics
* There is no special handling for case sensitivity, filename matches, or symbols
* It ignores all other scoring signals, since BM25 is not normalized and it can be hard to combine it with other signals
Right now we use panicf which leads a stack trace which is misleading at
what is happening and fills up the space used by kubernetes error
reporting. Additionally a few times we have had bug reports about the
watchdog failing. This commit updates the message to be far more
informative about next steps.
Additionally we update the watchdog error to include the response body
in case that contains useful information for debugging.
Test Plan: Updated the serveHealthz handler to always return an error.
Then ran the following
$ ZOEKT_WATCHDOG_TICK=1s go run ./cmd/zoekt-webserver
2023/09/14 15:55:27 custom ZOEKT_WATCHDOG_TICK=1s
2023/09/14 15:55:27 loading 1 shard(s): github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt
2023/09/14 15:55:28 watchdog: failed, will try 2 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:29 watchdog: failed, will try 1 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:30 watchdog health check has consecutively failed 3 times indicating is likely an unrecoverable error affecting zoekt. As such this process will exit with code 3.
Final error: watchdog: status=500 body="not ready: boom\n"
Possible Remediations:
- If this rarely happens, ignore and let your process manager restart zoekt.
- Possibly under provisioned. Try increasing CPU or disk IO.
- A bug. Reach out with logs and screenshots of metrics when this occurs.
exit status 3
This moves all decisions about if we should do display limits and how to
enforce them into one place. By default we make it stateful so it can be
used by the limitSender. But in every other place we will always sort
and then truncate, so we also introduce a convenience API for that.
The only potential behaviour change here is if you have no display
limits but had document ranks turned on, we would not sort matches
within a shard. However, the aggregator would still do sorting anyways
so this is really just a wasted sort. In practice a display limit should
always be set.
Test Plan: go test
All clients of zoekt have a shared problem: they have no reliable way to
bound the size of the SearchResult. The primary dimension that
determines the size of a SearchResult is the number of matches. None of
the existing levers zoekt provides sufficiently limit this size:
- MaxDocDisplayCount is a hard limit on the number of Files in the
SearchResult. But when a single File can have an arbitrary number of
matches for the query, you can still end up with enormous
SearchResults when this parameter is 1.
The existing *MaxMatchCount parameters are more about limiting the
amount of work zoekt does when executing queries than they are about
limiting the response size:
- TotalMaxMatchCount is a soft limit on the number of matches
across shards. But it is only evaluated after handling each shard, so
if a single shard has an enormous number of matches, the SearchResult
will be enormous.
- ShardMaxMatchCount is a soft limit on the number of matches from a
single shard. But it is only evaluated after handling each document, so
if a single document has an enormous number of matches, the
SearchResult will be enormous.
- ShardRepoMaxMatchCount, well, you get the idea.
Different clients have a differing ability to tolerate enormous
SearchResults. Sourcegraph, for example, is apparently doing just fine;
they put hard limits on the number of matches in their own server, which
is itself a client of zoekt. They're presumably able to tolerate large
responses from zoekt as it's running colocated in a datacenter
environment.
But clients that are, for example, running in browsers, and using the
less-compact JSON-encoded API, are much less able to cope with enormous
SearchResults, which can be multiple megabytes large even with the most
conservative applications of the existing parameters.
Enter MaxMatchDisplayCount, which has similar semantics to
MaxDocDisplayCount, and is used by zoekt in the exact same places as
that parameter. With this, clients can get a much better handle on the
size of zoekt SearchResults.
In #187 we collapsed the spans which caused all the LazyPrintf in defer
statements to be dropped.
In this PR I revert #187, deduplicate trace logs (EG we logged "num
files" and `opts` in multiple places) and log `stats` in streaming
search.
Test plan:
I confirmed locally that the logs show up as expected.
* Create buf-breaking-check.yml
buf breaking test for grpc proto file to prevent checking in breaking changes and maintaining backward compatibility
Follow up to: https://github.com/sourcegraph/sourcegraph/pull/55131
* Update buf-breaking-check.yml
add workflow_dispatch: {}
* Update .github/workflows/buf-breaking-check.yml
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
---------
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
The purpose of this commit is to reduce disk IO in case we skip a shard
because of missing ngrams.
To achieve this, we first check whether ALL ngrams exist in the shard before
loading the posting lists to determine their frequency. This means we have to
loop twice over the ngrams for the benefit of not loading any posting list in
case the shard would have been skipped anyways.
Test plan: This is a refactor, so relying on CI
We believe this will improve performance of the btree lookups. We are
investigating this to make it faster to rule out a shard (when freq==0).
Testing locally on a large corpus we halved the time spent in IO.
Locally Sort shows up in the profiles significantly, but there are two
facts mitigating that:
- Locally my file page cache is primed so IO rarely is going to disk.
- We likely will implement an IR for Zoekt which will amortize the Sort
to once per search rather than once per shard.
Test Plan: go test ./... and performance profiling via via ./cmd/zoekt.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
We have been running btree as the default for many months. We worried
about a performance hit, but it never happened. After some recent local
testing I did I noticed the btree actually interacted with the disk more
efficiently. So the old code both uses more memory and is slower, lets
just remove it.
Test Plan: go test ./...
This instruments our matchiter code to track how often is calls Get on
the underlying ngram index. We have some queries on long strings which
are going much slower than expected and we suspect it is related to the
work done here.
This change requires a few moving pieces:
- Adding the int to Stats
- Instrumenting places we log/emit Stats
- grpc support
- capturing this work in matchiter code
- updating tests to ensure correctness
The last two points are the meat of this commit, while the rest was
mechanical work. I reworked the TestAndSearch since into TestSearchStats
since it was the only test which had useful assertions on Stats.
I was surprised to see how high the ngram lookup values were for such
simple queries. This makes me more confident in improving our
performance here.
Test Plan: go test
We now call updateStats upto twice per shard search. The intention of
this is to capture statistics before pruning the matchtree. Previously
we would of done work in creating a matchtree but would then prune those
items away and would then never capture those statistics.
In practice that work was reading just one or two varint (the size of a
posting list) so likely had minimal impact on the reported statistics.
However, in the next commit we want to introduce a statistic which is
recorded even if we generate a noMatchTree.
The main technical part of this commit is ensuring all existing
updateStats functions can be called twice without overcounting.
Test Plan: go test
Previously this was only populated by the web package for serving a
specific endpoint. We now populate it in all places we do aggregation.
Test Plan: updated unit tests. Additionally loaded zoekt-webserver then
added to the index a repository which required 2 shards to index. The
repository count shown only increased by 1.
We missed this path in the last PR, since we assumed we did an RPC to
get this information. This change has a slightly larger blast radius
because indexserver needs to read back the index file to find out the
index time. Previously it just used inputs to the index when informing
sourcegraph.
Test Plan: pretty robust unit tests here.
This adds a gRPC API alongside the existing gob API.
It is enabled whenever the RPC setting is enabled. I didn't think it made sense to have a separate setting to enable it. It is only used when a gRPC request is detected (Content-Type: application/grpc). Eventually, we should likely open a separate port for gRPC traffic, but this should be okay for now.
In order to minimize the footprint of this change, we only use the protobuf definitions in the RPC layer. They are translated to/from the existing go types. This results in a small perf penalty.
This PR adds an experimental option `UseKeywordScoring` for scoring file matches using [BM25](https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/), the most common approach in keyword search. It treats each match in a file as a term and uses term frequencies to compute an approximation to BM25.
For now, it makes several simplifications:
* We drop inverse document frequency (idf) from the formula, since we don't have access to global term statistics
* There is no special handling for case sensitivity, filename matches, or symbols
* It ignores all other scoring signals, since BM25 is not normalized and it can be hard to combine it with other signals