Commits
While working on ranking, I noticed that sum-tf is wrong if we have filename and content matches.
We use `finalCands` in our BM25 scoring, however, `finalCands` is modified in `fillChunkMatches` and `fillMatches` which can lead to surprising scores.
Test plan:
updated unit test
* Honor regex flags for case-sensitivity
* change
* add test
* pr comment
The repo is indexed only with the reponame which is more user-friendly
in the search when there is onely one host
Filter string will be "r:my-repo" instead of "r:my-gerrit-sever.com/my-repo"
This allows indexing repos that are used for development and therefore have an
SSH 'origin' URL.
After this update "trivy fs go.mod" has a clean report.
go get golang.org/x/net@v0.23.0 google.golang.org/protobuf@v1.33.0
Test Plan: go test
The images produced are no longer consumed directly by sourcegraph so we
don't have security scanners running on them. Additionally we upgraded
to the latest alpine now. Both combined means we can simplify the apk
add lines.
Test Plan: built locally
docker build -t zoekt .
docker build -t zoekt-indexserver . -f Dockerfile.indexserver
docker build -t zoekt-webserver . -f Dockerfile.webserver
gitindex: include environ for git tests
A recent change introduced GIT_CONFIG_GLOBAL and GIT_CONFIG_SYSTEM, but
ran into the footgun of unsetting the rest of the environ when cmd.Env
is non-empty. This for example broke tests for me since we didn't have
my custom $PATH set.
Instead we revert the change and make a much smaller change to just
always set the relevant environment variables.
This reverts commit ab1b8f09199ea7a731fd80876fc5b0f376ff6882.
Test Plan: go test
We now only consume zoekt via gRPC at Sourcegraph and I doubt anyone
uses the old endpoints.
This will have one required update in sourcegraph, and that is to use
SenderFunc from the main zoekt package rather than from the now deleted
stream package.
go.mod requires go1.22. This should fix the build step on main.
Test Plan: monitor CI
The updates our "line model" to fix the edge cases that led to sourcegraph/sourcegraph#60605. In short, this changes the definition of a "line" to include its terminating newline (if it exists).
Before this, we had defined a "line" as starting at the byte after a newline (or the beginning of a file) and ending at the byte before a newline (or the end of the file).
The problem with that definition is that a newline that is the last byte in the file can never successfully be matched because we would trim that from the returned content, so any ranges that would match that trailing newline would be out of bounds in the result returned to the client. That's the reason behind the panics caused by #709, which was an attempt to formalize the "line does not include a trailing newline" definition.
So, instead, this redefines a line as ending at the byte after a newline (or the end of the file). This means that a regex can successfully and safely match a terminating newline.
BM25 considers a file to be a better match when there are many occurrences of
terms in the file. It's important to count all term occurrences, including
those in other fields like the filename.
For historical reasons, Zoekt trims all filename matches from a result if there
are any content matches. This meant that in BM25 scoring, we didn't account for
filename matches.
This PR refactors the match code so that we only trim filename matches when
assembling the final `FileMatch`. We retain filename matches when creating
`candidateMatch`, which lets BM25 scoring use them. Even without the better
BM25 scoring, I think this refactor makes the code easier to follow.
As part of our context quality work, I plan to update BM25 scoring so it takes
into account both filename and content matches. (Right now we ignore all
filename matches if there are any content matches.)
This is a preliminary refactor in the scoring code:
* Pull scoring logic into its own file `score.go`, since `eval.go` is super
long
* Make `indexData.Search` a bit shorter and add comments
We replace all calls to Regexp.String with a vendored version which is
faster.
go1.22 introduced a commit which "minimizes" the string returned by
Regexp.String(). Part of what it does is run enumerate through literals
runes in your string to see calculate flags related to unicode and case
sensitivity. This can be quite slow, but is made worse by the fact we
call it per shard per regexp in your query.Q to construct the matchtree.
Currently Regexp.String() represents 40% of CPU time on sourcegraph.com.
Before go1.22 it was ~0%.
Note: This is a temporary change to resolve the issue. I have a deeper
change to make this less clumsy.
Note: In one place we remove the use of string by relying on
Regexp.Equal instead.
Test Plan: go test
This reverts commit 931974dd953e07ce8fbe28e3b615323c2d156e40.
After monitoring production we are seeing increased latencies in List
calls. Additionally we are seeing many more scheduler transitions into
interactive queued. The only realistic cause of this was this commit, so
we are reverting for now until further investigation.
Test Plan: go test
The previous commit which added support for selectRepoSet in List was
ineffective since the queries we get for List do not look like (and ...)
but instead directly specify the reposet atom.
This commit adds support for it. Additionally to help with our manual
testing we add support for "repo:" query in the selectRepoSet
optimization.
While pairing on this we decided to improve the debug output to tracing
and rename the misnamed query.Q variable in List from r to q.
Test Plan: updated the test case to directly test the sort of queries we
get in List calls. Additionally ran "zoekt-webserver -pprof" and
observed in net/trace output that selectRepoSet optimization reduced the
number of searched shards.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
Previously List would never call proc.Yield, which broke the
co-operative scheduler in the case of slow List calls. Additionally, we
used a naive concurrency (a large buffered channel) which shows up in
the profiler as 30% of CPU spent on chan_ related operations under List.
This commit follows how Search used to respect proc.Yield. See sched.go
in 90ed7bfd07e0a3137e3ee627cbf6824446df9c4d. We did not copy Search
since it uses a more complicated implementation than we need since it
supports streaming, while List is still batch only.
We needed to use errgroup to ensure we drained all channels in the case
of an error. Previously we did not need to do this since the channels
had a buffer size of len(shards), which gaurenteed nothing would ever
block. Now channels are never larger than the number of workers (<=
GOMAXPROCS).
Test Plan: go test covers the no error cases. In the case of errors we
manually tested by running zoekt-webserver and adding a random context
cancellation. We observed the error being reported and no List
goroutines running.
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
Currently we have unindexed search asking Zoekt what commit it has for a
very specific repository. In normal search we use selectRepoSet to avoid
searching shards which are unrelated to the query. This uses the same
optimization in List.
Test Plan: go test
There was no test coverage, so I added some.
Closes #683.
In go 1.22 the way go converts a query like .* into a String changed.
This meant that when we compiled zoekt in the sourcegraph codebase our
"regex matches all" optimization stopped working.
This commit moves us to a simple inspection of the syntax tree which
will be stable across versions and to be honest just feels better.
To do this we introduce a field "origRegexp" to the regexpMatchTree so
we can avoid needing to recompile the syntax tree. Note: Regexp does not
offer a way to get at its internal syntax.Regexp which would be nice.
Test Plan: go test and added a unit test
Our CI pipeline uses apline:edge which now ship with 1.22. I noticed
that matchree_test.go doesn't pass anymore. There seem to have been
changes in regex package which we catch in our tests.
test plan:
go test ./... and CI
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
Previous commit used go1.22.0 but in the sourcegraph repo we are using
1.22.1.
We no longer serve these old HTTP APIs starting with Sourcegraph 5.3, so this code is no longer required.
I hope this makes your lives a little easier by not having to maintain two code paths.
If this change is annoying (because you cannot backport fixes to older versions as easily), feel free to either keep it on hold or close it.
## Test plan
CI still passes, anything else I should be trying?
With this update, we score matches which overlap with a symbol in any way. Before, we only added a score if the match was fully contained in a symbol.
Test plan:
- new unit tests
- updated scoring tests
Both of these fields have been unread and mark deprecated for a while.
Test Plan: go test
Currently, we truncate a file's contents to 2048 bytes before passing it to
`go-enry`. I ran into a few cases where this is causing us to misclassify
files.
This PR removes the truncation. It should still be fine in terms of
performance, since `go-enry` is quite fast in general: ~1ms in my local
testing, even for large files. And we only run language detection if we plan to
index the file, which means we skip binary files and large files.
gofumpt is a stricter gofmt. I took a look at the changes and in general
they are nice. I don't think we need to enforce the use of gofumpt, but
I like the idea of running it every once in a while.
Test Plan: go test ./...
This updates two of our dependencies to resolve GHSA-9763-4f94-gfch and
CVE-2023-47108.
Initially I ran
go get \
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@0.46.0 \
github.com/cloudflare/circl@1.3.7
But then "go mod tidy" failed since otel likes to refactor and rename
things for not much benefit to end users. I tried a few different things
to get it to work, but finally just updated all otel deps to the latest
version:
go get -u go.opentelemetry.io/otel/...
Test Plan: unit tests, "go mod tidy" is clean and "trivy fs go.mod"
reports no vulns.
Ran "go get google.golang.org/grpc@v1.56.3". If I just use the latest
grpc it causes OTEL to update, which has a whole cascade of updates. I'd
rather tackle that in a seperate commit since last time I did that it
caused a bunch of pain in the sourcegraph repo.
Test Plan: go test
Noticed we weren't using this yet and that the API signatures had
changed.
Test Plan: go test
This fixes an annoying bug it had for Nix users on Darwin (ie me).
Test Plan: zoekt-webserver no longer log spams on startup.
Previously, shards crashed for queries like "foo type:file" if foo was
not present.
Test plan:
updated e2e test
This is just the outcome of running gofmt with the simplify option.
Test Plan: go test
This adds the new Boost query atom to the protobuf defintion.
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
The old behavior led to panics in NextDoc if the child was pruned.
This commit introduces a new primitive Boost to our query language. It
allows boosting (or dampening) the contribution to the score a query
atoms will match contribute.
To achieve this we introduce boostMatchTree which records this weight.
We then adjust the visitMatches to take an initial score weight (1.0),
and then each time we recurse through a boostMatchTree the score weight
is multiplied by the boost weight. Additionally candidateMatch now has a
new field, scoreWeight, which records the weight at time of candidate
collection. Without boosting in the query this value will always be 1.
Finally when scoring a candidateMatch we take the final score for it and
multiply it by scoreWeight.
Note: we do not expose a way to set this in the query language, only the
query API.
Test Plan: Manual testing against webserver via the new phrase-boost URL
param. Additionally updated ranking tests to use the phrase booster.
Many a time I have seen a slow search without tracing turned on. I then
go to visit the net/trace page to see what happened but it is missing
the stats to try work out why.
This commit will ensure we always log the aggregated statistics. This
should be cheap to do given Add is relatively fast and LazyPrintf will
only do the stringer operation if someone loads the debug page. It does
mean a Stats object lives a bit longer in memory, but it is small.
Test Plan: go test
When thinking about transforming queries like 'foo bar' into '(foo bar)
or "foo bar"' we would want to keep the phrase candidateMatch and not
throw it away in gatherMatches. By sorting longer matches before others
that start at the same offset we end up keeping those.
Note: this only affects ChunkMatch, since for LineMatch we merge when we
find overlaps.
Test Plan: This was quite hard to test with our existing e2e tests due
to them not recording offsets, only matching lines. So instead I am just
relying on the fact we didn't break anything and once we add proper
support for phrases we will have a test then.
My last PR missed this. Now that we only score by candidateMatch we can
remove the matchScore implementation which was used to score LineMatch.
Test Plan: go test
This is a refactor which removes our duplicated scoring logic for
ChunkMatch vs LineMatch. Instead we score slices of []*candidateMatch.
Other than being a good refactor, candidateMatch is a much more
appropriate structure to stuff in extra information for scoring than our
public APIs. So this enables the work we want to do around atom based
scoring.
The only behaviour change in this commit are two fixes:
- DocumentSection caching would fail if empty since we relied on the
empty cache to be non-nil. This lead to inflated ContentBytesLoaded.
- Empty FileMatch would still read in DocumentSection, even though it
wasn't needed.
Test Plan: go test. Our coverage is decent, we have lots of ranking
tests which did not change and we have hardcoded stats of work done
which did not change (except for the above fixes).
There are several fields which are optional in FileMatch. For example
branch and version are only set if indexing a git repo (zoekt can index
arbitrary files). We mark these fields as optional to marshal in JSON.
The main benefit of this is improving the readability of some golden
files.
Note: I noticed this when an unrelated commit touched these golden
files.
Test Plan: go test
This example was given in our channel recently as a good result, so lets
keep track of it to ensure we don't regress.
Test Plan: go test
This has been running for three months and we haven't found a need to
disable it.
Test Plan: go test ./...
I am often reading the output of String in traces and logs, and it is
really hard to parse since there are many fields and most are unset.
This is a quality of life improvement so it is much easier to scan the
output.
For example the default zoekt-webserver struct's string output goes from
a 456 byte string to
zoekt.SearchOptions{ ShardMaxMatchCount=100000 TotalMaxMatchCount=1000000 MaxWallTime=10s }
Test Plan: go test. The unit tests ensure I cover every field now and in
the future when fields are added.
This adds four cases which are exact phrases to search for.
"assets are not configured for this binary" finds the correct document,
but it isn't shown in the summary. Our target rank doesn't capture that
this could be better, but I will use the golden file to see if I improve
this.
"sourcegraph/server docker image build" is an example of an exact phrase
which also happens to contain words of highly ranked symbols. This leads
to the exact phrase getting buried. I want to see if I can boost that.
"bufio flush writer" should find the symbol bufioFlushWriter.
"coverage data writer" should find the symbol CoverageDataWriter.
Test Plan: go test
Structs related to matches can occur a lot in memory. As such there is
some value to ensuring the order of the fields is aligned to avoid
unneccessary padding.
The "fieldalignment" tool was used to find these changes.
Test Plan: go test
This adds a new field to the golden files "targetRank" which records the
rank of the document we are looking for. Additionally the document is
marked with "**" in the golden files.
Additionally we add a new golden file which contains recall@1, recall@5
and the MRR.
I set the target documents by looking at the existing results and
guessing which was the one we wanted based on memory. In some cases we
no longer had the top document, for example for generate unit test.
Test Plan: go test
While iterating on the ranking tests it is frustrating to wait for the
index to be built. This commit adds a "-shard_cache" flag to reuse the
computed shards. It defaults to off since this could lead to unexpected
results if we change how we index.
Test Plan: run go test several times with the flag, only the first run
should be slow. Then confirm without flag it is "slow".
$ go test -shard_cache
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 23.307s
$ go test -shard_cache
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 0.235s
$ go test -shard_cache
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 0.218s
$ go test
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 23.416s
We had ranking e2e tests living in the zoekt-archive-index cmd for
convenience since that contained useful functions for indexing a remote
tarball from the GitHub API. This commit splits the archive
functionality into a new internal/archive package and the ranking tests
into a new internal/e2e package.
The zoekt-archive-index code is now quite minimal. This is similiar to
how zoekt-git-index mostly just calls out to the gitindex package. What
is different is that archive package is marked internal, unlike
gitindex. gitindex should also be internal, but the code predates go's
support for internal.
I suspect more of our e2e tests will end up in this package.
Test Plan: go test ./...
This change uses the fact that candidate matches should be increasing in byte
offset, to avoid recounting runes on a line. Before this change if you have
many matches on the same line we would call `utf8.RuneCount` for each match,
which is a `O(nm)` algorithm where `n` is your line length and `m` is the
number of matches. After this change the complexity is `O(n)`.
I came across this while investigating slow performance for searching the
string "dev" on s2 taking 2s if the match limits where 100k instead of 10k.
With 10k it would take 0.04s. It turns out with the larger limit we ended up
searching a file were the word dev appeared many times on one line. Running a
profiler against the service came up with 96% of CPU time in `utf8.RuneCount`.
This commit adds a benchmark for the helper introduced to reuse RuneCounts.
Unsurprisingly the difference is massive between `O(nm)` and `O(n)` :)
name old time/op new time/op delta
ColumnHelper-32 299ms ± 2% 0ms ± 2% -99.97% (p=0.000 n=10+10)
Test Plan: Added tests and benchmarks.
I'd like to start taking advantage of the NumContextLines option, but have been running into an issue where, if context lines were to extend past the end of the file, we get the trailing newline at the end of the file.
This is not desirable because the empty slice after a trailing newline is not treated as a "line" by any editor, so when we split on newlines, an empty line is shown to the user. By the time we're splitting on newlines, we do not know whether or not we're at the end of the file, so we cannot know whether we can trim that trailing newline, or whether it is, correctly, an empty line that should be rendered.
This is really annoying stuff that bites us elsewhere as well (searcher, syntax highlighter). It might be nice to unify the definitions of "what is a line" in some library, but for now, I'll be happy with "don't return an extra line."
The images produced are no longer consumed directly by sourcegraph so we
don't have security scanners running on them. Additionally we upgraded
to the latest alpine now. Both combined means we can simplify the apk
add lines.
Test Plan: built locally
docker build -t zoekt .
docker build -t zoekt-indexserver . -f Dockerfile.indexserver
docker build -t zoekt-webserver . -f Dockerfile.webserver
gitindex: include environ for git tests
A recent change introduced GIT_CONFIG_GLOBAL and GIT_CONFIG_SYSTEM, but
ran into the footgun of unsetting the rest of the environ when cmd.Env
is non-empty. This for example broke tests for me since we didn't have
my custom $PATH set.
Instead we revert the change and make a much smaller change to just
always set the relevant environment variables.
This reverts commit ab1b8f09199ea7a731fd80876fc5b0f376ff6882.
Test Plan: go test
The updates our "line model" to fix the edge cases that led to sourcegraph/sourcegraph#60605. In short, this changes the definition of a "line" to include its terminating newline (if it exists).
Before this, we had defined a "line" as starting at the byte after a newline (or the beginning of a file) and ending at the byte before a newline (or the end of the file).
The problem with that definition is that a newline that is the last byte in the file can never successfully be matched because we would trim that from the returned content, so any ranges that would match that trailing newline would be out of bounds in the result returned to the client. That's the reason behind the panics caused by #709, which was an attempt to formalize the "line does not include a trailing newline" definition.
So, instead, this redefines a line as ending at the byte after a newline (or the end of the file). This means that a regex can successfully and safely match a terminating newline.
BM25 considers a file to be a better match when there are many occurrences of
terms in the file. It's important to count all term occurrences, including
those in other fields like the filename.
For historical reasons, Zoekt trims all filename matches from a result if there
are any content matches. This meant that in BM25 scoring, we didn't account for
filename matches.
This PR refactors the match code so that we only trim filename matches when
assembling the final `FileMatch`. We retain filename matches when creating
`candidateMatch`, which lets BM25 scoring use them. Even without the better
BM25 scoring, I think this refactor makes the code easier to follow.
As part of our context quality work, I plan to update BM25 scoring so it takes
into account both filename and content matches. (Right now we ignore all
filename matches if there are any content matches.)
This is a preliminary refactor in the scoring code:
* Pull scoring logic into its own file `score.go`, since `eval.go` is super
long
* Make `indexData.Search` a bit shorter and add comments
We replace all calls to Regexp.String with a vendored version which is
faster.
go1.22 introduced a commit which "minimizes" the string returned by
Regexp.String(). Part of what it does is run enumerate through literals
runes in your string to see calculate flags related to unicode and case
sensitivity. This can be quite slow, but is made worse by the fact we
call it per shard per regexp in your query.Q to construct the matchtree.
Currently Regexp.String() represents 40% of CPU time on sourcegraph.com.
Before go1.22 it was ~0%.
Note: This is a temporary change to resolve the issue. I have a deeper
change to make this less clumsy.
Note: In one place we remove the use of string by relying on
Regexp.Equal instead.
Test Plan: go test
This reverts commit 931974dd953e07ce8fbe28e3b615323c2d156e40.
After monitoring production we are seeing increased latencies in List
calls. Additionally we are seeing many more scheduler transitions into
interactive queued. The only realistic cause of this was this commit, so
we are reverting for now until further investigation.
Test Plan: go test
The previous commit which added support for selectRepoSet in List was
ineffective since the queries we get for List do not look like (and ...)
but instead directly specify the reposet atom.
This commit adds support for it. Additionally to help with our manual
testing we add support for "repo:" query in the selectRepoSet
optimization.
While pairing on this we decided to improve the debug output to tracing
and rename the misnamed query.Q variable in List from r to q.
Test Plan: updated the test case to directly test the sort of queries we
get in List calls. Additionally ran "zoekt-webserver -pprof" and
observed in net/trace output that selectRepoSet optimization reduced the
number of searched shards.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
Previously List would never call proc.Yield, which broke the
co-operative scheduler in the case of slow List calls. Additionally, we
used a naive concurrency (a large buffered channel) which shows up in
the profiler as 30% of CPU spent on chan_ related operations under List.
This commit follows how Search used to respect proc.Yield. See sched.go
in 90ed7bfd07e0a3137e3ee627cbf6824446df9c4d. We did not copy Search
since it uses a more complicated implementation than we need since it
supports streaming, while List is still batch only.
We needed to use errgroup to ensure we drained all channels in the case
of an error. Previously we did not need to do this since the channels
had a buffer size of len(shards), which gaurenteed nothing would ever
block. Now channels are never larger than the number of workers (<=
GOMAXPROCS).
Test Plan: go test covers the no error cases. In the case of errors we
manually tested by running zoekt-webserver and adding a random context
cancellation. We observed the error being reported and no List
goroutines running.
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
In go 1.22 the way go converts a query like .* into a String changed.
This meant that when we compiled zoekt in the sourcegraph codebase our
"regex matches all" optimization stopped working.
This commit moves us to a simple inspection of the syntax tree which
will be stable across versions and to be honest just feels better.
To do this we introduce a field "origRegexp" to the regexpMatchTree so
we can avoid needing to recompile the syntax tree. Note: Regexp does not
offer a way to get at its internal syntax.Regexp which would be nice.
Test Plan: go test and added a unit test
We no longer serve these old HTTP APIs starting with Sourcegraph 5.3, so this code is no longer required.
I hope this makes your lives a little easier by not having to maintain two code paths.
If this change is annoying (because you cannot backport fixes to older versions as easily), feel free to either keep it on hold or close it.
## Test plan
CI still passes, anything else I should be trying?
Currently, we truncate a file's contents to 2048 bytes before passing it to
`go-enry`. I ran into a few cases where this is causing us to misclassify
files.
This PR removes the truncation. It should still be fine in terms of
performance, since `go-enry` is quite fast in general: ~1ms in my local
testing, even for large files. And we only run language detection if we plan to
index the file, which means we skip binary files and large files.
This updates two of our dependencies to resolve GHSA-9763-4f94-gfch and
CVE-2023-47108.
Initially I ran
go get \
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@0.46.0 \
github.com/cloudflare/circl@1.3.7
But then "go mod tidy" failed since otel likes to refactor and rename
things for not much benefit to end users. I tried a few different things
to get it to work, but finally just updated all otel deps to the latest
version:
go get -u go.opentelemetry.io/otel/...
Test Plan: unit tests, "go mod tidy" is clean and "trivy fs go.mod"
reports no vulns.
This commit introduces a new primitive Boost to our query language. It
allows boosting (or dampening) the contribution to the score a query
atoms will match contribute.
To achieve this we introduce boostMatchTree which records this weight.
We then adjust the visitMatches to take an initial score weight (1.0),
and then each time we recurse through a boostMatchTree the score weight
is multiplied by the boost weight. Additionally candidateMatch now has a
new field, scoreWeight, which records the weight at time of candidate
collection. Without boosting in the query this value will always be 1.
Finally when scoring a candidateMatch we take the final score for it and
multiply it by scoreWeight.
Note: we do not expose a way to set this in the query language, only the
query API.
Test Plan: Manual testing against webserver via the new phrase-boost URL
param. Additionally updated ranking tests to use the phrase booster.
Many a time I have seen a slow search without tracing turned on. I then
go to visit the net/trace page to see what happened but it is missing
the stats to try work out why.
This commit will ensure we always log the aggregated statistics. This
should be cheap to do given Add is relatively fast and LazyPrintf will
only do the stringer operation if someone loads the debug page. It does
mean a Stats object lives a bit longer in memory, but it is small.
Test Plan: go test
When thinking about transforming queries like 'foo bar' into '(foo bar)
or "foo bar"' we would want to keep the phrase candidateMatch and not
throw it away in gatherMatches. By sorting longer matches before others
that start at the same offset we end up keeping those.
Note: this only affects ChunkMatch, since for LineMatch we merge when we
find overlaps.
Test Plan: This was quite hard to test with our existing e2e tests due
to them not recording offsets, only matching lines. So instead I am just
relying on the fact we didn't break anything and once we add proper
support for phrases we will have a test then.
This is a refactor which removes our duplicated scoring logic for
ChunkMatch vs LineMatch. Instead we score slices of []*candidateMatch.
Other than being a good refactor, candidateMatch is a much more
appropriate structure to stuff in extra information for scoring than our
public APIs. So this enables the work we want to do around atom based
scoring.
The only behaviour change in this commit are two fixes:
- DocumentSection caching would fail if empty since we relied on the
empty cache to be non-nil. This lead to inflated ContentBytesLoaded.
- Empty FileMatch would still read in DocumentSection, even though it
wasn't needed.
Test Plan: go test. Our coverage is decent, we have lots of ranking
tests which did not change and we have hardcoded stats of work done
which did not change (except for the above fixes).
There are several fields which are optional in FileMatch. For example
branch and version are only set if indexing a git repo (zoekt can index
arbitrary files). We mark these fields as optional to marshal in JSON.
The main benefit of this is improving the readability of some golden
files.
Note: I noticed this when an unrelated commit touched these golden
files.
Test Plan: go test
I am often reading the output of String in traces and logs, and it is
really hard to parse since there are many fields and most are unset.
This is a quality of life improvement so it is much easier to scan the
output.
For example the default zoekt-webserver struct's string output goes from
a 456 byte string to
zoekt.SearchOptions{ ShardMaxMatchCount=100000 TotalMaxMatchCount=1000000 MaxWallTime=10s }
Test Plan: go test. The unit tests ensure I cover every field now and in
the future when fields are added.
This adds four cases which are exact phrases to search for.
"assets are not configured for this binary" finds the correct document,
but it isn't shown in the summary. Our target rank doesn't capture that
this could be better, but I will use the golden file to see if I improve
this.
"sourcegraph/server docker image build" is an example of an exact phrase
which also happens to contain words of highly ranked symbols. This leads
to the exact phrase getting buried. I want to see if I can boost that.
"bufio flush writer" should find the symbol bufioFlushWriter.
"coverage data writer" should find the symbol CoverageDataWriter.
Test Plan: go test
This adds a new field to the golden files "targetRank" which records the
rank of the document we are looking for. Additionally the document is
marked with "**" in the golden files.
Additionally we add a new golden file which contains recall@1, recall@5
and the MRR.
I set the target documents by looking at the existing results and
guessing which was the one we wanted based on memory. In some cases we
no longer had the top document, for example for generate unit test.
Test Plan: go test
While iterating on the ranking tests it is frustrating to wait for the
index to be built. This commit adds a "-shard_cache" flag to reuse the
computed shards. It defaults to off since this could lead to unexpected
results if we change how we index.
Test Plan: run go test several times with the flag, only the first run
should be slow. Then confirm without flag it is "slow".
$ go test -shard_cache
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 23.307s
$ go test -shard_cache
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 0.235s
$ go test -shard_cache
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 0.218s
$ go test
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 23.416s
We had ranking e2e tests living in the zoekt-archive-index cmd for
convenience since that contained useful functions for indexing a remote
tarball from the GitHub API. This commit splits the archive
functionality into a new internal/archive package and the ranking tests
into a new internal/e2e package.
The zoekt-archive-index code is now quite minimal. This is similiar to
how zoekt-git-index mostly just calls out to the gitindex package. What
is different is that archive package is marked internal, unlike
gitindex. gitindex should also be internal, but the code predates go's
support for internal.
I suspect more of our e2e tests will end up in this package.
Test Plan: go test ./...
This change uses the fact that candidate matches should be increasing in byte
offset, to avoid recounting runes on a line. Before this change if you have
many matches on the same line we would call `utf8.RuneCount` for each match,
which is a `O(nm)` algorithm where `n` is your line length and `m` is the
number of matches. After this change the complexity is `O(n)`.
I came across this while investigating slow performance for searching the
string "dev" on s2 taking 2s if the match limits where 100k instead of 10k.
With 10k it would take 0.04s. It turns out with the larger limit we ended up
searching a file were the word dev appeared many times on one line. Running a
profiler against the service came up with 96% of CPU time in `utf8.RuneCount`.
This commit adds a benchmark for the helper introduced to reuse RuneCounts.
Unsurprisingly the difference is massive between `O(nm)` and `O(n)` :)
name old time/op new time/op delta
ColumnHelper-32 299ms ± 2% 0ms ± 2% -99.97% (p=0.000 n=10+10)
Test Plan: Added tests and benchmarks.
I'd like to start taking advantage of the NumContextLines option, but have been running into an issue where, if context lines were to extend past the end of the file, we get the trailing newline at the end of the file.
This is not desirable because the empty slice after a trailing newline is not treated as a "line" by any editor, so when we split on newlines, an empty line is shown to the user. By the time we're splitting on newlines, we do not know whether or not we're at the end of the file, so we cannot know whether we can trim that trailing newline, or whether it is, correctly, an empty line that should be rendered.
This is really annoying stuff that bites us elsewhere as well (searcher, syntax highlighter). It might be nice to unify the definitions of "what is a line" in some library, but for now, I'll be happy with "don't return an extra line."