Commits
This introduces an experiment where we can stop looking up ngrams at a
certain limit. The insight here is that for large substrings we spend
more time finding the smallest ngram frequency than the time a normal
search takes. So instead we can try and find a good balance between
looking for a good (two) ngrams and actually searching the corpus.
The plan is to set different values for
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT in sourcegraph production and
see how it affects performance of attribution search service.
Test Plan: ran all tests with the envvar set to 2. I expected tests that
assert on stats to fail, but everything else to pass. This was the case.
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT=2 go test ./...
Updates grafana/regexp to the latest version of the speedup branch. This
incorporates the changes from upstream go that have happened.
Test Plan: go test ./...
Add fallbacks for languages not supported yet by linguist or go-enry
So far, we didn't include IDF in our BM25 score function. Zoekt uses a
trigram index and hence doesn't compute document frequency during
indexing. We could add this information to the index, but it is not
immediately obvious how to tokenize code in a way that is compatible
with tokens from a natural language query.
Here we calulate the document frequency at query time under the
assumption that we visit all documents containing any of the query terms.
Notes:
Also fixed an off-by-1 bug with how we count documents.
Test plan:
- Updated unit test
- Context evaluation results are slightly worse with a decrease from 64/89 to 63/89
* Update VSCode launch.json to enable easier debugging
* Add index path
As described, we'd like this information to be externally available to
drive metrics for incremental indexing.
Closes #780.
When we introduced filename boosting in BM25, we set it to a very conservative
weight. This PR increases the weight from 2.0 -> 5.0, which improves results on
relevant evals.
Relates to SPLF-88
Follow up to #779. This PR removes the logic for trigrams with the same
frequency, because it will no longer have a big effect.
We select the two least frequent trigrams to create the candidate match
iterator. It's common for these trigrams to overlap. This change shifts the
first and last trigrams to avoid overlap, which is guaranteed to result in a
smaller intersection. For frequent terms, this can substantially reduce the
number of candidate matches we consider.
It's confusing to call this `UseKeywordScoring`, since we do not use it for the
`keyword` patterntype in Sourcegraph. This commit clarifies the name to mention
BM25.
The current configuration creates a conflict with default
fetch config `+refs/heads/*:refs/heads/*`.
The first fetch is ok but once a local "refs/heads/meta/config" exists
fetch fails because 2 fetch rules match for local "refs/heads/meta/config" :
"refs/meta/config" and "refs/heads/meta/config"
Changing local ref name fixes the problem
This is so that when authentication details change we correctly update
the CloneURL for repositories managed by our mirror commands.
While working on ranking, I noticed that sum-tf is wrong if we have filename and content matches.
We use `finalCands` in our BM25 scoring, however, `finalCands` is modified in `fillChunkMatches` and `fillMatches` which can lead to surprising scores.
Test plan:
updated unit test
* Honor regex flags for case-sensitivity
* change
* add test
* pr comment
The repo is indexed only with the reponame which is more user-friendly
in the search when there is onely one host
Filter string will be "r:my-repo" instead of "r:my-gerrit-sever.com/my-repo"
This allows indexing repos that are used for development and therefore have an
SSH 'origin' URL.
After this update "trivy fs go.mod" has a clean report.
go get golang.org/x/net@v0.23.0 google.golang.org/protobuf@v1.33.0
Test Plan: go test
The images produced are no longer consumed directly by sourcegraph so we
don't have security scanners running on them. Additionally we upgraded
to the latest alpine now. Both combined means we can simplify the apk
add lines.
Test Plan: built locally
docker build -t zoekt .
docker build -t zoekt-indexserver . -f Dockerfile.indexserver
docker build -t zoekt-webserver . -f Dockerfile.webserver
gitindex: include environ for git tests
A recent change introduced GIT_CONFIG_GLOBAL and GIT_CONFIG_SYSTEM, but
ran into the footgun of unsetting the rest of the environ when cmd.Env
is non-empty. This for example broke tests for me since we didn't have
my custom $PATH set.
Instead we revert the change and make a much smaller change to just
always set the relevant environment variables.
This reverts commit ab1b8f09199ea7a731fd80876fc5b0f376ff6882.
Test Plan: go test
We now only consume zoekt via gRPC at Sourcegraph and I doubt anyone
uses the old endpoints.
This will have one required update in sourcegraph, and that is to use
SenderFunc from the main zoekt package rather than from the now deleted
stream package.
go.mod requires go1.22. This should fix the build step on main.
Test Plan: monitor CI
The updates our "line model" to fix the edge cases that led to sourcegraph/sourcegraph#60605. In short, this changes the definition of a "line" to include its terminating newline (if it exists).
Before this, we had defined a "line" as starting at the byte after a newline (or the beginning of a file) and ending at the byte before a newline (or the end of the file).
The problem with that definition is that a newline that is the last byte in the file can never successfully be matched because we would trim that from the returned content, so any ranges that would match that trailing newline would be out of bounds in the result returned to the client. That's the reason behind the panics caused by #709, which was an attempt to formalize the "line does not include a trailing newline" definition.
So, instead, this redefines a line as ending at the byte after a newline (or the end of the file). This means that a regex can successfully and safely match a terminating newline.
BM25 considers a file to be a better match when there are many occurrences of
terms in the file. It's important to count all term occurrences, including
those in other fields like the filename.
For historical reasons, Zoekt trims all filename matches from a result if there
are any content matches. This meant that in BM25 scoring, we didn't account for
filename matches.
This PR refactors the match code so that we only trim filename matches when
assembling the final `FileMatch`. We retain filename matches when creating
`candidateMatch`, which lets BM25 scoring use them. Even without the better
BM25 scoring, I think this refactor makes the code easier to follow.
As part of our context quality work, I plan to update BM25 scoring so it takes
into account both filename and content matches. (Right now we ignore all
filename matches if there are any content matches.)
This is a preliminary refactor in the scoring code:
* Pull scoring logic into its own file `score.go`, since `eval.go` is super
long
* Make `indexData.Search` a bit shorter and add comments
We replace all calls to Regexp.String with a vendored version which is
faster.
go1.22 introduced a commit which "minimizes" the string returned by
Regexp.String(). Part of what it does is run enumerate through literals
runes in your string to see calculate flags related to unicode and case
sensitivity. This can be quite slow, but is made worse by the fact we
call it per shard per regexp in your query.Q to construct the matchtree.
Currently Regexp.String() represents 40% of CPU time on sourcegraph.com.
Before go1.22 it was ~0%.
Note: This is a temporary change to resolve the issue. I have a deeper
change to make this less clumsy.
Note: In one place we remove the use of string by relying on
Regexp.Equal instead.
Test Plan: go test
This reverts commit 931974dd953e07ce8fbe28e3b615323c2d156e40.
After monitoring production we are seeing increased latencies in List
calls. Additionally we are seeing many more scheduler transitions into
interactive queued. The only realistic cause of this was this commit, so
we are reverting for now until further investigation.
Test Plan: go test
The previous commit which added support for selectRepoSet in List was
ineffective since the queries we get for List do not look like (and ...)
but instead directly specify the reposet atom.
This commit adds support for it. Additionally to help with our manual
testing we add support for "repo:" query in the selectRepoSet
optimization.
While pairing on this we decided to improve the debug output to tracing
and rename the misnamed query.Q variable in List from r to q.
Test Plan: updated the test case to directly test the sort of queries we
get in List calls. Additionally ran "zoekt-webserver -pprof" and
observed in net/trace output that selectRepoSet optimization reduced the
number of searched shards.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
Previously List would never call proc.Yield, which broke the
co-operative scheduler in the case of slow List calls. Additionally, we
used a naive concurrency (a large buffered channel) which shows up in
the profiler as 30% of CPU spent on chan_ related operations under List.
This commit follows how Search used to respect proc.Yield. See sched.go
in 90ed7bfd07e0a3137e3ee627cbf6824446df9c4d. We did not copy Search
since it uses a more complicated implementation than we need since it
supports streaming, while List is still batch only.
We needed to use errgroup to ensure we drained all channels in the case
of an error. Previously we did not need to do this since the channels
had a buffer size of len(shards), which gaurenteed nothing would ever
block. Now channels are never larger than the number of workers (<=
GOMAXPROCS).
Test Plan: go test covers the no error cases. In the case of errors we
manually tested by running zoekt-webserver and adding a random context
cancellation. We observed the error being reported and no List
goroutines running.
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
Currently we have unindexed search asking Zoekt what commit it has for a
very specific repository. In normal search we use selectRepoSet to avoid
searching shards which are unrelated to the query. This uses the same
optimization in List.
Test Plan: go test
There was no test coverage, so I added some.
Closes #683.
In go 1.22 the way go converts a query like .* into a String changed.
This meant that when we compiled zoekt in the sourcegraph codebase our
"regex matches all" optimization stopped working.
This commit moves us to a simple inspection of the syntax tree which
will be stable across versions and to be honest just feels better.
To do this we introduce a field "origRegexp" to the regexpMatchTree so
we can avoid needing to recompile the syntax tree. Note: Regexp does not
offer a way to get at its internal syntax.Regexp which would be nice.
Test Plan: go test and added a unit test
Our CI pipeline uses apline:edge which now ship with 1.22. I noticed
that matchree_test.go doesn't pass anymore. There seem to have been
changes in regex package which we catch in our tests.
test plan:
go test ./... and CI
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
Previous commit used go1.22.0 but in the sourcegraph repo we are using
1.22.1.
We no longer serve these old HTTP APIs starting with Sourcegraph 5.3, so this code is no longer required.
I hope this makes your lives a little easier by not having to maintain two code paths.
If this change is annoying (because you cannot backport fixes to older versions as easily), feel free to either keep it on hold or close it.
## Test plan
CI still passes, anything else I should be trying?
With this update, we score matches which overlap with a symbol in any way. Before, we only added a score if the match was fully contained in a symbol.
Test plan:
- new unit tests
- updated scoring tests
Both of these fields have been unread and mark deprecated for a while.
Test Plan: go test
Currently, we truncate a file's contents to 2048 bytes before passing it to
`go-enry`. I ran into a few cases where this is causing us to misclassify
files.
This PR removes the truncation. It should still be fine in terms of
performance, since `go-enry` is quite fast in general: ~1ms in my local
testing, even for large files. And we only run language detection if we plan to
index the file, which means we skip binary files and large files.
gofumpt is a stricter gofmt. I took a look at the changes and in general
they are nice. I don't think we need to enforce the use of gofumpt, but
I like the idea of running it every once in a while.
Test Plan: go test ./...
This updates two of our dependencies to resolve GHSA-9763-4f94-gfch and
CVE-2023-47108.
Initially I ran
go get \
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@0.46.0 \
github.com/cloudflare/circl@1.3.7
But then "go mod tidy" failed since otel likes to refactor and rename
things for not much benefit to end users. I tried a few different things
to get it to work, but finally just updated all otel deps to the latest
version:
go get -u go.opentelemetry.io/otel/...
Test Plan: unit tests, "go mod tidy" is clean and "trivy fs go.mod"
reports no vulns.
Ran "go get google.golang.org/grpc@v1.56.3". If I just use the latest
grpc it causes OTEL to update, which has a whole cascade of updates. I'd
rather tackle that in a seperate commit since last time I did that it
caused a bunch of pain in the sourcegraph repo.
Test Plan: go test
Noticed we weren't using this yet and that the API signatures had
changed.
Test Plan: go test
This fixes an annoying bug it had for Nix users on Darwin (ie me).
Test Plan: zoekt-webserver no longer log spams on startup.
Previously, shards crashed for queries like "foo type:file" if foo was
not present.
Test plan:
updated e2e test
This is just the outcome of running gofmt with the simplify option.
Test Plan: go test
This adds the new Boost query atom to the protobuf defintion.
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
The old behavior led to panics in NextDoc if the child was pruned.
This commit introduces a new primitive Boost to our query language. It
allows boosting (or dampening) the contribution to the score a query
atoms will match contribute.
To achieve this we introduce boostMatchTree which records this weight.
We then adjust the visitMatches to take an initial score weight (1.0),
and then each time we recurse through a boostMatchTree the score weight
is multiplied by the boost weight. Additionally candidateMatch now has a
new field, scoreWeight, which records the weight at time of candidate
collection. Without boosting in the query this value will always be 1.
Finally when scoring a candidateMatch we take the final score for it and
multiply it by scoreWeight.
Note: we do not expose a way to set this in the query language, only the
query API.
Test Plan: Manual testing against webserver via the new phrase-boost URL
param. Additionally updated ranking tests to use the phrase booster.
Many a time I have seen a slow search without tracing turned on. I then
go to visit the net/trace page to see what happened but it is missing
the stats to try work out why.
This commit will ensure we always log the aggregated statistics. This
should be cheap to do given Add is relatively fast and LazyPrintf will
only do the stringer operation if someone loads the debug page. It does
mean a Stats object lives a bit longer in memory, but it is small.
Test Plan: go test
This introduces an experiment where we can stop looking up ngrams at a
certain limit. The insight here is that for large substrings we spend
more time finding the smallest ngram frequency than the time a normal
search takes. So instead we can try and find a good balance between
looking for a good (two) ngrams and actually searching the corpus.
The plan is to set different values for
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT in sourcegraph production and
see how it affects performance of attribution search service.
Test Plan: ran all tests with the envvar set to 2. I expected tests that
assert on stats to fail, but everything else to pass. This was the case.
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT=2 go test ./...
So far, we didn't include IDF in our BM25 score function. Zoekt uses a
trigram index and hence doesn't compute document frequency during
indexing. We could add this information to the index, but it is not
immediately obvious how to tokenize code in a way that is compatible
with tokens from a natural language query.
Here we calulate the document frequency at query time under the
assumption that we visit all documents containing any of the query terms.
Notes:
Also fixed an off-by-1 bug with how we count documents.
Test plan:
- Updated unit test
- Context evaluation results are slightly worse with a decrease from 64/89 to 63/89
We select the two least frequent trigrams to create the candidate match
iterator. It's common for these trigrams to overlap. This change shifts the
first and last trigrams to avoid overlap, which is guaranteed to result in a
smaller intersection. For frequent terms, this can substantially reduce the
number of candidate matches we consider.
The current configuration creates a conflict with default
fetch config `+refs/heads/*:refs/heads/*`.
The first fetch is ok but once a local "refs/heads/meta/config" exists
fetch fails because 2 fetch rules match for local "refs/heads/meta/config" :
"refs/meta/config" and "refs/heads/meta/config"
Changing local ref name fixes the problem
The images produced are no longer consumed directly by sourcegraph so we
don't have security scanners running on them. Additionally we upgraded
to the latest alpine now. Both combined means we can simplify the apk
add lines.
Test Plan: built locally
docker build -t zoekt .
docker build -t zoekt-indexserver . -f Dockerfile.indexserver
docker build -t zoekt-webserver . -f Dockerfile.webserver
gitindex: include environ for git tests
A recent change introduced GIT_CONFIG_GLOBAL and GIT_CONFIG_SYSTEM, but
ran into the footgun of unsetting the rest of the environ when cmd.Env
is non-empty. This for example broke tests for me since we didn't have
my custom $PATH set.
Instead we revert the change and make a much smaller change to just
always set the relevant environment variables.
This reverts commit ab1b8f09199ea7a731fd80876fc5b0f376ff6882.
Test Plan: go test
The updates our "line model" to fix the edge cases that led to sourcegraph/sourcegraph#60605. In short, this changes the definition of a "line" to include its terminating newline (if it exists).
Before this, we had defined a "line" as starting at the byte after a newline (or the beginning of a file) and ending at the byte before a newline (or the end of the file).
The problem with that definition is that a newline that is the last byte in the file can never successfully be matched because we would trim that from the returned content, so any ranges that would match that trailing newline would be out of bounds in the result returned to the client. That's the reason behind the panics caused by #709, which was an attempt to formalize the "line does not include a trailing newline" definition.
So, instead, this redefines a line as ending at the byte after a newline (or the end of the file). This means that a regex can successfully and safely match a terminating newline.
BM25 considers a file to be a better match when there are many occurrences of
terms in the file. It's important to count all term occurrences, including
those in other fields like the filename.
For historical reasons, Zoekt trims all filename matches from a result if there
are any content matches. This meant that in BM25 scoring, we didn't account for
filename matches.
This PR refactors the match code so that we only trim filename matches when
assembling the final `FileMatch`. We retain filename matches when creating
`candidateMatch`, which lets BM25 scoring use them. Even without the better
BM25 scoring, I think this refactor makes the code easier to follow.
As part of our context quality work, I plan to update BM25 scoring so it takes
into account both filename and content matches. (Right now we ignore all
filename matches if there are any content matches.)
This is a preliminary refactor in the scoring code:
* Pull scoring logic into its own file `score.go`, since `eval.go` is super
long
* Make `indexData.Search` a bit shorter and add comments
We replace all calls to Regexp.String with a vendored version which is
faster.
go1.22 introduced a commit which "minimizes" the string returned by
Regexp.String(). Part of what it does is run enumerate through literals
runes in your string to see calculate flags related to unicode and case
sensitivity. This can be quite slow, but is made worse by the fact we
call it per shard per regexp in your query.Q to construct the matchtree.
Currently Regexp.String() represents 40% of CPU time on sourcegraph.com.
Before go1.22 it was ~0%.
Note: This is a temporary change to resolve the issue. I have a deeper
change to make this less clumsy.
Note: In one place we remove the use of string by relying on
Regexp.Equal instead.
Test Plan: go test
This reverts commit 931974dd953e07ce8fbe28e3b615323c2d156e40.
After monitoring production we are seeing increased latencies in List
calls. Additionally we are seeing many more scheduler transitions into
interactive queued. The only realistic cause of this was this commit, so
we are reverting for now until further investigation.
Test Plan: go test
The previous commit which added support for selectRepoSet in List was
ineffective since the queries we get for List do not look like (and ...)
but instead directly specify the reposet atom.
This commit adds support for it. Additionally to help with our manual
testing we add support for "repo:" query in the selectRepoSet
optimization.
While pairing on this we decided to improve the debug output to tracing
and rename the misnamed query.Q variable in List from r to q.
Test Plan: updated the test case to directly test the sort of queries we
get in List calls. Additionally ran "zoekt-webserver -pprof" and
observed in net/trace output that selectRepoSet optimization reduced the
number of searched shards.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
Previously List would never call proc.Yield, which broke the
co-operative scheduler in the case of slow List calls. Additionally, we
used a naive concurrency (a large buffered channel) which shows up in
the profiler as 30% of CPU spent on chan_ related operations under List.
This commit follows how Search used to respect proc.Yield. See sched.go
in 90ed7bfd07e0a3137e3ee627cbf6824446df9c4d. We did not copy Search
since it uses a more complicated implementation than we need since it
supports streaming, while List is still batch only.
We needed to use errgroup to ensure we drained all channels in the case
of an error. Previously we did not need to do this since the channels
had a buffer size of len(shards), which gaurenteed nothing would ever
block. Now channels are never larger than the number of workers (<=
GOMAXPROCS).
Test Plan: go test covers the no error cases. In the case of errors we
manually tested by running zoekt-webserver and adding a random context
cancellation. We observed the error being reported and no List
goroutines running.
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
In go 1.22 the way go converts a query like .* into a String changed.
This meant that when we compiled zoekt in the sourcegraph codebase our
"regex matches all" optimization stopped working.
This commit moves us to a simple inspection of the syntax tree which
will be stable across versions and to be honest just feels better.
To do this we introduce a field "origRegexp" to the regexpMatchTree so
we can avoid needing to recompile the syntax tree. Note: Regexp does not
offer a way to get at its internal syntax.Regexp which would be nice.
Test Plan: go test and added a unit test
We no longer serve these old HTTP APIs starting with Sourcegraph 5.3, so this code is no longer required.
I hope this makes your lives a little easier by not having to maintain two code paths.
If this change is annoying (because you cannot backport fixes to older versions as easily), feel free to either keep it on hold or close it.
## Test plan
CI still passes, anything else I should be trying?
Currently, we truncate a file's contents to 2048 bytes before passing it to
`go-enry`. I ran into a few cases where this is causing us to misclassify
files.
This PR removes the truncation. It should still be fine in terms of
performance, since `go-enry` is quite fast in general: ~1ms in my local
testing, even for large files. And we only run language detection if we plan to
index the file, which means we skip binary files and large files.
This updates two of our dependencies to resolve GHSA-9763-4f94-gfch and
CVE-2023-47108.
Initially I ran
go get \
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@0.46.0 \
github.com/cloudflare/circl@1.3.7
But then "go mod tidy" failed since otel likes to refactor and rename
things for not much benefit to end users. I tried a few different things
to get it to work, but finally just updated all otel deps to the latest
version:
go get -u go.opentelemetry.io/otel/...
Test Plan: unit tests, "go mod tidy" is clean and "trivy fs go.mod"
reports no vulns.
This commit introduces a new primitive Boost to our query language. It
allows boosting (or dampening) the contribution to the score a query
atoms will match contribute.
To achieve this we introduce boostMatchTree which records this weight.
We then adjust the visitMatches to take an initial score weight (1.0),
and then each time we recurse through a boostMatchTree the score weight
is multiplied by the boost weight. Additionally candidateMatch now has a
new field, scoreWeight, which records the weight at time of candidate
collection. Without boosting in the query this value will always be 1.
Finally when scoring a candidateMatch we take the final score for it and
multiply it by scoreWeight.
Note: we do not expose a way to set this in the query language, only the
query API.
Test Plan: Manual testing against webserver via the new phrase-boost URL
param. Additionally updated ranking tests to use the phrase booster.
Many a time I have seen a slow search without tracing turned on. I then
go to visit the net/trace page to see what happened but it is missing
the stats to try work out why.
This commit will ensure we always log the aggregated statistics. This
should be cheap to do given Add is relatively fast and LazyPrintf will
only do the stringer operation if someone loads the debug page. It does
mean a Stats object lives a bit longer in memory, but it is small.
Test Plan: go test