Commits
Test Plan: trivy fs go.mod runs clean
The new metric `index_indexing_delay_seconds` is looking good on dot com and
producing more interpretable results. This PR removes the old metric
`index_queue_age_seconds`. And it increases the bucket range for the new
metric, since we sometimes exceed the current 2 day limit.
`Queue` is always created through `NewQueue`, which always initializes the data
structures. So we don't need to initialize them in other methods.
The `sourcegraph/sourcegraph` repo is now private, so we switch to
`sourcegraph-public-snapshot` for these tests. Eventually, we'll move the
relevant tests to the sourcegraph repo itself.
Our dashboards currently report indexing delay using the
`index_queue_age_seconds` metric, which tracks the duration indexing jobs wait
in the queue before being "popped" for indexing. For dot com, we regularly see
the median for this metric at 2 days. However, this metric may be misleading
because it includes "noop" indexing jobs where the index is already up to date.
These jobs can stay in the queue for a long time since they are not high
priority.
This PR adds a new metric `index_indexing_delay_seconds`, which reports the
duration from when an indexing job enters the queue, to when it completes. It
uses 'state' as a label, so by setting `state='success'` we can (roughly) see
the delay from when Zoekt learns about a new version, to when that version is
available to search.
If `maxFreq=0` then `rand.Intn` will panic. This PR ensures it is always >= 1.
Example failure: https://github.com/sourcegraph/zoekt/actions/runs/10301894396/job/28514416201?pr=804
This is an optional dependency that I don't believe OSS users of docker
use. We used to use these dockerfiles in Sourcegraph, but have since
moved to building it in our own CI with scip-ctags.
Test Plan: built locally
docker build -t zoekt .
docker build -t zoekt-indexserver . -f Dockerfile.indexserver
docker build -t zoekt-webserver . -f Dockerfile.webserver
Remove the dependency on the sourcegraph monorepo
Any write to the index dir triggered a scan. This means on busy
instances we are constantly rescanning, leading to an
over-representation in CPU profiles around watch. The events are
normally writes to our temporary files. By only considering events for
.zoekt files (which is what scan reads) we can avoid the constant scan
calls.
Just in case we also introduce a re-scan every minute in case we miss an
event. There is error handling around this, but I thought it is just
more reliable to call scan every once in a while.
Note: this doesn't represent significant CPU use, but it does muddy the
CPU profiler output. So this makes it easier to understand trends in our
continuous cpu profiling.
Test Plan: CI
This enables shard merging by default for zoekt-sourcegraph-indexserver.
Sourcegraph has been using shard merging in production for several years. We have recently confirmed significant performance improvements for queries which are bound by matchTree construction.
I also remove -merge_max_priority because we have stopped using it.
Use SRC_DISABLE_SHARD_MERGING to disable shard merging.
Test plan:
mostly CI, I did some manual testing to confirm that shard merging is enabled by default for zoekt-sourcegraph-indexserver.
* feat(Search): Add support for all Apex language extensions
* clean up comment
* Fix typo
After defaulting to shard merging for all inactive repos, this in fact
makes searches slightly slower. So we can remove the experiment.
Test Plan: go test
The first bit of data I am getting back indicates this strategy of
limiting the number of ngrams we lookup isn't working. I am still
experimenting with different limits, but in the meantime it is easy to
implement a strategy which picks a random subset. This is so that the
first N ngrams of a query aren't the only ones being consulted.
Test Plan: ran all tests with the envvar set to 2. I expected tests that
assert on stats to fail, but everything else to pass. This was the case.
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT=2 go test ./...
This introduces an experiment where we can stop looking up ngrams at a
certain limit. The insight here is that for large substrings we spend
more time finding the smallest ngram frequency than the time a normal
search takes. So instead we can try and find a good balance between
looking for a good (two) ngrams and actually searching the corpus.
The plan is to set different values for
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT in sourcegraph production and
see how it affects performance of attribution search service.
Test Plan: ran all tests with the envvar set to 2. I expected tests that
assert on stats to fail, but everything else to pass. This was the case.
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT=2 go test ./...
Updates grafana/regexp to the latest version of the speedup branch. This
incorporates the changes from upstream go that have happened.
Test Plan: go test ./...
Add fallbacks for languages not supported yet by linguist or go-enry
So far, we didn't include IDF in our BM25 score function. Zoekt uses a
trigram index and hence doesn't compute document frequency during
indexing. We could add this information to the index, but it is not
immediately obvious how to tokenize code in a way that is compatible
with tokens from a natural language query.
Here we calulate the document frequency at query time under the
assumption that we visit all documents containing any of the query terms.
Notes:
Also fixed an off-by-1 bug with how we count documents.
Test plan:
- Updated unit test
- Context evaluation results are slightly worse with a decrease from 64/89 to 63/89
* Update VSCode launch.json to enable easier debugging
* Add index path
As described, we'd like this information to be externally available to
drive metrics for incremental indexing.
Closes #780.
When we introduced filename boosting in BM25, we set it to a very conservative
weight. This PR increases the weight from 2.0 -> 5.0, which improves results on
relevant evals.
Relates to SPLF-88
Follow up to #779. This PR removes the logic for trigrams with the same
frequency, because it will no longer have a big effect.
We select the two least frequent trigrams to create the candidate match
iterator. It's common for these trigrams to overlap. This change shifts the
first and last trigrams to avoid overlap, which is guaranteed to result in a
smaller intersection. For frequent terms, this can substantially reduce the
number of candidate matches we consider.
It's confusing to call this `UseKeywordScoring`, since we do not use it for the
`keyword` patterntype in Sourcegraph. This commit clarifies the name to mention
BM25.
The current configuration creates a conflict with default
fetch config `+refs/heads/*:refs/heads/*`.
The first fetch is ok but once a local "refs/heads/meta/config" exists
fetch fails because 2 fetch rules match for local "refs/heads/meta/config" :
"refs/meta/config" and "refs/heads/meta/config"
Changing local ref name fixes the problem
This is so that when authentication details change we correctly update
the CloneURL for repositories managed by our mirror commands.
While working on ranking, I noticed that sum-tf is wrong if we have filename and content matches.
We use `finalCands` in our BM25 scoring, however, `finalCands` is modified in `fillChunkMatches` and `fillMatches` which can lead to surprising scores.
Test plan:
updated unit test
* Honor regex flags for case-sensitivity
* change
* add test
* pr comment
The repo is indexed only with the reponame which is more user-friendly
in the search when there is onely one host
Filter string will be "r:my-repo" instead of "r:my-gerrit-sever.com/my-repo"
This allows indexing repos that are used for development and therefore have an
SSH 'origin' URL.
After this update "trivy fs go.mod" has a clean report.
go get golang.org/x/net@v0.23.0 google.golang.org/protobuf@v1.33.0
Test Plan: go test
The images produced are no longer consumed directly by sourcegraph so we
don't have security scanners running on them. Additionally we upgraded
to the latest alpine now. Both combined means we can simplify the apk
add lines.
Test Plan: built locally
docker build -t zoekt .
docker build -t zoekt-indexserver . -f Dockerfile.indexserver
docker build -t zoekt-webserver . -f Dockerfile.webserver
gitindex: include environ for git tests
A recent change introduced GIT_CONFIG_GLOBAL and GIT_CONFIG_SYSTEM, but
ran into the footgun of unsetting the rest of the environ when cmd.Env
is non-empty. This for example broke tests for me since we didn't have
my custom $PATH set.
Instead we revert the change and make a much smaller change to just
always set the relevant environment variables.
This reverts commit ab1b8f09199ea7a731fd80876fc5b0f376ff6882.
Test Plan: go test
We now only consume zoekt via gRPC at Sourcegraph and I doubt anyone
uses the old endpoints.
This will have one required update in sourcegraph, and that is to use
SenderFunc from the main zoekt package rather than from the now deleted
stream package.
go.mod requires go1.22. This should fix the build step on main.
Test Plan: monitor CI
The updates our "line model" to fix the edge cases that led to sourcegraph/sourcegraph#60605. In short, this changes the definition of a "line" to include its terminating newline (if it exists).
Before this, we had defined a "line" as starting at the byte after a newline (or the beginning of a file) and ending at the byte before a newline (or the end of the file).
The problem with that definition is that a newline that is the last byte in the file can never successfully be matched because we would trim that from the returned content, so any ranges that would match that trailing newline would be out of bounds in the result returned to the client. That's the reason behind the panics caused by #709, which was an attempt to formalize the "line does not include a trailing newline" definition.
So, instead, this redefines a line as ending at the byte after a newline (or the end of the file). This means that a regex can successfully and safely match a terminating newline.
BM25 considers a file to be a better match when there are many occurrences of
terms in the file. It's important to count all term occurrences, including
those in other fields like the filename.
For historical reasons, Zoekt trims all filename matches from a result if there
are any content matches. This meant that in BM25 scoring, we didn't account for
filename matches.
This PR refactors the match code so that we only trim filename matches when
assembling the final `FileMatch`. We retain filename matches when creating
`candidateMatch`, which lets BM25 scoring use them. Even without the better
BM25 scoring, I think this refactor makes the code easier to follow.
As part of our context quality work, I plan to update BM25 scoring so it takes
into account both filename and content matches. (Right now we ignore all
filename matches if there are any content matches.)
This is a preliminary refactor in the scoring code:
* Pull scoring logic into its own file `score.go`, since `eval.go` is super
long
* Make `indexData.Search` a bit shorter and add comments
We replace all calls to Regexp.String with a vendored version which is
faster.
go1.22 introduced a commit which "minimizes" the string returned by
Regexp.String(). Part of what it does is run enumerate through literals
runes in your string to see calculate flags related to unicode and case
sensitivity. This can be quite slow, but is made worse by the fact we
call it per shard per regexp in your query.Q to construct the matchtree.
Currently Regexp.String() represents 40% of CPU time on sourcegraph.com.
Before go1.22 it was ~0%.
Note: This is a temporary change to resolve the issue. I have a deeper
change to make this less clumsy.
Note: In one place we remove the use of string by relying on
Regexp.Equal instead.
Test Plan: go test
This reverts commit 931974dd953e07ce8fbe28e3b615323c2d156e40.
After monitoring production we are seeing increased latencies in List
calls. Additionally we are seeing many more scheduler transitions into
interactive queued. The only realistic cause of this was this commit, so
we are reverting for now until further investigation.
Test Plan: go test
The previous commit which added support for selectRepoSet in List was
ineffective since the queries we get for List do not look like (and ...)
but instead directly specify the reposet atom.
This commit adds support for it. Additionally to help with our manual
testing we add support for "repo:" query in the selectRepoSet
optimization.
While pairing on this we decided to improve the debug output to tracing
and rename the misnamed query.Q variable in List from r to q.
Test Plan: updated the test case to directly test the sort of queries we
get in List calls. Additionally ran "zoekt-webserver -pprof" and
observed in net/trace output that selectRepoSet optimization reduced the
number of searched shards.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
Previously List would never call proc.Yield, which broke the
co-operative scheduler in the case of slow List calls. Additionally, we
used a naive concurrency (a large buffered channel) which shows up in
the profiler as 30% of CPU spent on chan_ related operations under List.
This commit follows how Search used to respect proc.Yield. See sched.go
in 90ed7bfd07e0a3137e3ee627cbf6824446df9c4d. We did not copy Search
since it uses a more complicated implementation than we need since it
supports streaming, while List is still batch only.
We needed to use errgroup to ensure we drained all channels in the case
of an error. Previously we did not need to do this since the channels
had a buffer size of len(shards), which gaurenteed nothing would ever
block. Now channels are never larger than the number of workers (<=
GOMAXPROCS).
Test Plan: go test covers the no error cases. In the case of errors we
manually tested by running zoekt-webserver and adding a random context
cancellation. We observed the error being reported and no List
goroutines running.
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
Currently we have unindexed search asking Zoekt what commit it has for a
very specific repository. In normal search we use selectRepoSet to avoid
searching shards which are unrelated to the query. This uses the same
optimization in List.
Test Plan: go test
There was no test coverage, so I added some.
Closes #683.
In go 1.22 the way go converts a query like .* into a String changed.
This meant that when we compiled zoekt in the sourcegraph codebase our
"regex matches all" optimization stopped working.
This commit moves us to a simple inspection of the syntax tree which
will be stable across versions and to be honest just feels better.
To do this we introduce a field "origRegexp" to the regexpMatchTree so
we can avoid needing to recompile the syntax tree. Note: Regexp does not
offer a way to get at its internal syntax.Regexp which would be nice.
Test Plan: go test and added a unit test
Our CI pipeline uses apline:edge which now ship with 1.22. I noticed
that matchree_test.go doesn't pass anymore. There seem to have been
changes in regex package which we catch in our tests.
test plan:
go test ./... and CI
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
Previous commit used go1.22.0 but in the sourcegraph repo we are using
1.22.1.
Our dashboards currently report indexing delay using the
`index_queue_age_seconds` metric, which tracks the duration indexing jobs wait
in the queue before being "popped" for indexing. For dot com, we regularly see
the median for this metric at 2 days. However, this metric may be misleading
because it includes "noop" indexing jobs where the index is already up to date.
These jobs can stay in the queue for a long time since they are not high
priority.
This PR adds a new metric `index_indexing_delay_seconds`, which reports the
duration from when an indexing job enters the queue, to when it completes. It
uses 'state' as a label, so by setting `state='success'` we can (roughly) see
the delay from when Zoekt learns about a new version, to when that version is
available to search.
This is an optional dependency that I don't believe OSS users of docker
use. We used to use these dockerfiles in Sourcegraph, but have since
moved to building it in our own CI with scip-ctags.
Test Plan: built locally
docker build -t zoekt .
docker build -t zoekt-indexserver . -f Dockerfile.indexserver
docker build -t zoekt-webserver . -f Dockerfile.webserver
Any write to the index dir triggered a scan. This means on busy
instances we are constantly rescanning, leading to an
over-representation in CPU profiles around watch. The events are
normally writes to our temporary files. By only considering events for
.zoekt files (which is what scan reads) we can avoid the constant scan
calls.
Just in case we also introduce a re-scan every minute in case we miss an
event. There is error handling around this, but I thought it is just
more reliable to call scan every once in a while.
Note: this doesn't represent significant CPU use, but it does muddy the
CPU profiler output. So this makes it easier to understand trends in our
continuous cpu profiling.
Test Plan: CI
This enables shard merging by default for zoekt-sourcegraph-indexserver.
Sourcegraph has been using shard merging in production for several years. We have recently confirmed significant performance improvements for queries which are bound by matchTree construction.
I also remove -merge_max_priority because we have stopped using it.
Use SRC_DISABLE_SHARD_MERGING to disable shard merging.
Test plan:
mostly CI, I did some manual testing to confirm that shard merging is enabled by default for zoekt-sourcegraph-indexserver.
The first bit of data I am getting back indicates this strategy of
limiting the number of ngrams we lookup isn't working. I am still
experimenting with different limits, but in the meantime it is easy to
implement a strategy which picks a random subset. This is so that the
first N ngrams of a query aren't the only ones being consulted.
Test Plan: ran all tests with the envvar set to 2. I expected tests that
assert on stats to fail, but everything else to pass. This was the case.
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT=2 go test ./...
This introduces an experiment where we can stop looking up ngrams at a
certain limit. The insight here is that for large substrings we spend
more time finding the smallest ngram frequency than the time a normal
search takes. So instead we can try and find a good balance between
looking for a good (two) ngrams and actually searching the corpus.
The plan is to set different values for
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT in sourcegraph production and
see how it affects performance of attribution search service.
Test Plan: ran all tests with the envvar set to 2. I expected tests that
assert on stats to fail, but everything else to pass. This was the case.
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT=2 go test ./...
So far, we didn't include IDF in our BM25 score function. Zoekt uses a
trigram index and hence doesn't compute document frequency during
indexing. We could add this information to the index, but it is not
immediately obvious how to tokenize code in a way that is compatible
with tokens from a natural language query.
Here we calulate the document frequency at query time under the
assumption that we visit all documents containing any of the query terms.
Notes:
Also fixed an off-by-1 bug with how we count documents.
Test plan:
- Updated unit test
- Context evaluation results are slightly worse with a decrease from 64/89 to 63/89
We select the two least frequent trigrams to create the candidate match
iterator. It's common for these trigrams to overlap. This change shifts the
first and last trigrams to avoid overlap, which is guaranteed to result in a
smaller intersection. For frequent terms, this can substantially reduce the
number of candidate matches we consider.
The current configuration creates a conflict with default
fetch config `+refs/heads/*:refs/heads/*`.
The first fetch is ok but once a local "refs/heads/meta/config" exists
fetch fails because 2 fetch rules match for local "refs/heads/meta/config" :
"refs/meta/config" and "refs/heads/meta/config"
Changing local ref name fixes the problem
The images produced are no longer consumed directly by sourcegraph so we
don't have security scanners running on them. Additionally we upgraded
to the latest alpine now. Both combined means we can simplify the apk
add lines.
Test Plan: built locally
docker build -t zoekt .
docker build -t zoekt-indexserver . -f Dockerfile.indexserver
docker build -t zoekt-webserver . -f Dockerfile.webserver
gitindex: include environ for git tests
A recent change introduced GIT_CONFIG_GLOBAL and GIT_CONFIG_SYSTEM, but
ran into the footgun of unsetting the rest of the environ when cmd.Env
is non-empty. This for example broke tests for me since we didn't have
my custom $PATH set.
Instead we revert the change and make a much smaller change to just
always set the relevant environment variables.
This reverts commit ab1b8f09199ea7a731fd80876fc5b0f376ff6882.
Test Plan: go test
The updates our "line model" to fix the edge cases that led to sourcegraph/sourcegraph#60605. In short, this changes the definition of a "line" to include its terminating newline (if it exists).
Before this, we had defined a "line" as starting at the byte after a newline (or the beginning of a file) and ending at the byte before a newline (or the end of the file).
The problem with that definition is that a newline that is the last byte in the file can never successfully be matched because we would trim that from the returned content, so any ranges that would match that trailing newline would be out of bounds in the result returned to the client. That's the reason behind the panics caused by #709, which was an attempt to formalize the "line does not include a trailing newline" definition.
So, instead, this redefines a line as ending at the byte after a newline (or the end of the file). This means that a regex can successfully and safely match a terminating newline.
BM25 considers a file to be a better match when there are many occurrences of
terms in the file. It's important to count all term occurrences, including
those in other fields like the filename.
For historical reasons, Zoekt trims all filename matches from a result if there
are any content matches. This meant that in BM25 scoring, we didn't account for
filename matches.
This PR refactors the match code so that we only trim filename matches when
assembling the final `FileMatch`. We retain filename matches when creating
`candidateMatch`, which lets BM25 scoring use them. Even without the better
BM25 scoring, I think this refactor makes the code easier to follow.
As part of our context quality work, I plan to update BM25 scoring so it takes
into account both filename and content matches. (Right now we ignore all
filename matches if there are any content matches.)
This is a preliminary refactor in the scoring code:
* Pull scoring logic into its own file `score.go`, since `eval.go` is super
long
* Make `indexData.Search` a bit shorter and add comments
We replace all calls to Regexp.String with a vendored version which is
faster.
go1.22 introduced a commit which "minimizes" the string returned by
Regexp.String(). Part of what it does is run enumerate through literals
runes in your string to see calculate flags related to unicode and case
sensitivity. This can be quite slow, but is made worse by the fact we
call it per shard per regexp in your query.Q to construct the matchtree.
Currently Regexp.String() represents 40% of CPU time on sourcegraph.com.
Before go1.22 it was ~0%.
Note: This is a temporary change to resolve the issue. I have a deeper
change to make this less clumsy.
Note: In one place we remove the use of string by relying on
Regexp.Equal instead.
Test Plan: go test
This reverts commit 931974dd953e07ce8fbe28e3b615323c2d156e40.
After monitoring production we are seeing increased latencies in List
calls. Additionally we are seeing many more scheduler transitions into
interactive queued. The only realistic cause of this was this commit, so
we are reverting for now until further investigation.
Test Plan: go test
The previous commit which added support for selectRepoSet in List was
ineffective since the queries we get for List do not look like (and ...)
but instead directly specify the reposet atom.
This commit adds support for it. Additionally to help with our manual
testing we add support for "repo:" query in the selectRepoSet
optimization.
While pairing on this we decided to improve the debug output to tracing
and rename the misnamed query.Q variable in List from r to q.
Test Plan: updated the test case to directly test the sort of queries we
get in List calls. Additionally ran "zoekt-webserver -pprof" and
observed in net/trace output that selectRepoSet optimization reduced the
number of searched shards.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
Previously List would never call proc.Yield, which broke the
co-operative scheduler in the case of slow List calls. Additionally, we
used a naive concurrency (a large buffered channel) which shows up in
the profiler as 30% of CPU spent on chan_ related operations under List.
This commit follows how Search used to respect proc.Yield. See sched.go
in 90ed7bfd07e0a3137e3ee627cbf6824446df9c4d. We did not copy Search
since it uses a more complicated implementation than we need since it
supports streaming, while List is still batch only.
We needed to use errgroup to ensure we drained all channels in the case
of an error. Previously we did not need to do this since the channels
had a buffer size of len(shards), which gaurenteed nothing would ever
block. Now channels are never larger than the number of workers (<=
GOMAXPROCS).
Test Plan: go test covers the no error cases. In the case of errors we
manually tested by running zoekt-webserver and adding a random context
cancellation. We observed the error being reported and no List
goroutines running.
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
In go 1.22 the way go converts a query like .* into a String changed.
This meant that when we compiled zoekt in the sourcegraph codebase our
"regex matches all" optimization stopped working.
This commit moves us to a simple inspection of the syntax tree which
will be stable across versions and to be honest just feels better.
To do this we introduce a field "origRegexp" to the regexpMatchTree so
we can avoid needing to recompile the syntax tree. Note: Regexp does not
offer a way to get at its internal syntax.Regexp which would be nice.
Test Plan: go test and added a unit test