Commits
We replace all calls to Regexp.String with a vendored version which is
faster.
go1.22 introduced a commit which "minimizes" the string returned by
Regexp.String(). Part of what it does is run enumerate through literals
runes in your string to see calculate flags related to unicode and case
sensitivity. This can be quite slow, but is made worse by the fact we
call it per shard per regexp in your query.Q to construct the matchtree.
Currently Regexp.String() represents 40% of CPU time on sourcegraph.com.
Before go1.22 it was ~0%.
Note: This is a temporary change to resolve the issue. I have a deeper
change to make this less clumsy.
Note: In one place we remove the use of string by relying on
Regexp.Equal instead.
Test Plan: go test
This reverts commit 931974dd953e07ce8fbe28e3b615323c2d156e40.
After monitoring production we are seeing increased latencies in List
calls. Additionally we are seeing many more scheduler transitions into
interactive queued. The only realistic cause of this was this commit, so
we are reverting for now until further investigation.
Test Plan: go test
The previous commit which added support for selectRepoSet in List was
ineffective since the queries we get for List do not look like (and ...)
but instead directly specify the reposet atom.
This commit adds support for it. Additionally to help with our manual
testing we add support for "repo:" query in the selectRepoSet
optimization.
While pairing on this we decided to improve the debug output to tracing
and rename the misnamed query.Q variable in List from r to q.
Test Plan: updated the test case to directly test the sort of queries we
get in List calls. Additionally ran "zoekt-webserver -pprof" and
observed in net/trace output that selectRepoSet optimization reduced the
number of searched shards.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
Previously List would never call proc.Yield, which broke the
co-operative scheduler in the case of slow List calls. Additionally, we
used a naive concurrency (a large buffered channel) which shows up in
the profiler as 30% of CPU spent on chan_ related operations under List.
This commit follows how Search used to respect proc.Yield. See sched.go
in 90ed7bfd07e0a3137e3ee627cbf6824446df9c4d. We did not copy Search
since it uses a more complicated implementation than we need since it
supports streaming, while List is still batch only.
We needed to use errgroup to ensure we drained all channels in the case
of an error. Previously we did not need to do this since the channels
had a buffer size of len(shards), which gaurenteed nothing would ever
block. Now channels are never larger than the number of workers (<=
GOMAXPROCS).
Test Plan: go test covers the no error cases. In the case of errors we
manually tested by running zoekt-webserver and adding a random context
cancellation. We observed the error being reported and no List
goroutines running.
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
Currently we have unindexed search asking Zoekt what commit it has for a
very specific repository. In normal search we use selectRepoSet to avoid
searching shards which are unrelated to the query. This uses the same
optimization in List.
Test Plan: go test
There was no test coverage, so I added some.
Closes #683.
In go 1.22 the way go converts a query like .* into a String changed.
This meant that when we compiled zoekt in the sourcegraph codebase our
"regex matches all" optimization stopped working.
This commit moves us to a simple inspection of the syntax tree which
will be stable across versions and to be honest just feels better.
To do this we introduce a field "origRegexp" to the regexpMatchTree so
we can avoid needing to recompile the syntax tree. Note: Regexp does not
offer a way to get at its internal syntax.Regexp which would be nice.
Test Plan: go test and added a unit test
Our CI pipeline uses apline:edge which now ship with 1.22. I noticed
that matchree_test.go doesn't pass anymore. There seem to have been
changes in regex package which we catch in our tests.
test plan:
go test ./... and CI
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
Previous commit used go1.22.0 but in the sourcegraph repo we are using
1.22.1.
We no longer serve these old HTTP APIs starting with Sourcegraph 5.3, so this code is no longer required.
I hope this makes your lives a little easier by not having to maintain two code paths.
If this change is annoying (because you cannot backport fixes to older versions as easily), feel free to either keep it on hold or close it.
## Test plan
CI still passes, anything else I should be trying?
With this update, we score matches which overlap with a symbol in any way. Before, we only added a score if the match was fully contained in a symbol.
Test plan:
- new unit tests
- updated scoring tests
Both of these fields have been unread and mark deprecated for a while.
Test Plan: go test
Currently, we truncate a file's contents to 2048 bytes before passing it to
`go-enry`. I ran into a few cases where this is causing us to misclassify
files.
This PR removes the truncation. It should still be fine in terms of
performance, since `go-enry` is quite fast in general: ~1ms in my local
testing, even for large files. And we only run language detection if we plan to
index the file, which means we skip binary files and large files.
gofumpt is a stricter gofmt. I took a look at the changes and in general
they are nice. I don't think we need to enforce the use of gofumpt, but
I like the idea of running it every once in a while.
Test Plan: go test ./...
This updates two of our dependencies to resolve GHSA-9763-4f94-gfch and
CVE-2023-47108.
Initially I ran
go get \
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@0.46.0 \
github.com/cloudflare/circl@1.3.7
But then "go mod tidy" failed since otel likes to refactor and rename
things for not much benefit to end users. I tried a few different things
to get it to work, but finally just updated all otel deps to the latest
version:
go get -u go.opentelemetry.io/otel/...
Test Plan: unit tests, "go mod tidy" is clean and "trivy fs go.mod"
reports no vulns.
Ran "go get google.golang.org/grpc@v1.56.3". If I just use the latest
grpc it causes OTEL to update, which has a whole cascade of updates. I'd
rather tackle that in a seperate commit since last time I did that it
caused a bunch of pain in the sourcegraph repo.
Test Plan: go test
Noticed we weren't using this yet and that the API signatures had
changed.
Test Plan: go test
This fixes an annoying bug it had for Nix users on Darwin (ie me).
Test Plan: zoekt-webserver no longer log spams on startup.
Previously, shards crashed for queries like "foo type:file" if foo was
not present.
Test plan:
updated e2e test
This is just the outcome of running gofmt with the simplify option.
Test Plan: go test
This adds the new Boost query atom to the protobuf defintion.
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
The old behavior led to panics in NextDoc if the child was pruned.
This commit introduces a new primitive Boost to our query language. It
allows boosting (or dampening) the contribution to the score a query
atoms will match contribute.
To achieve this we introduce boostMatchTree which records this weight.
We then adjust the visitMatches to take an initial score weight (1.0),
and then each time we recurse through a boostMatchTree the score weight
is multiplied by the boost weight. Additionally candidateMatch now has a
new field, scoreWeight, which records the weight at time of candidate
collection. Without boosting in the query this value will always be 1.
Finally when scoring a candidateMatch we take the final score for it and
multiply it by scoreWeight.
Note: we do not expose a way to set this in the query language, only the
query API.
Test Plan: Manual testing against webserver via the new phrase-boost URL
param. Additionally updated ranking tests to use the phrase booster.
Many a time I have seen a slow search without tracing turned on. I then
go to visit the net/trace page to see what happened but it is missing
the stats to try work out why.
This commit will ensure we always log the aggregated statistics. This
should be cheap to do given Add is relatively fast and LazyPrintf will
only do the stringer operation if someone loads the debug page. It does
mean a Stats object lives a bit longer in memory, but it is small.
Test Plan: go test
When thinking about transforming queries like 'foo bar' into '(foo bar)
or "foo bar"' we would want to keep the phrase candidateMatch and not
throw it away in gatherMatches. By sorting longer matches before others
that start at the same offset we end up keeping those.
Note: this only affects ChunkMatch, since for LineMatch we merge when we
find overlaps.
Test Plan: This was quite hard to test with our existing e2e tests due
to them not recording offsets, only matching lines. So instead I am just
relying on the fact we didn't break anything and once we add proper
support for phrases we will have a test then.
My last PR missed this. Now that we only score by candidateMatch we can
remove the matchScore implementation which was used to score LineMatch.
Test Plan: go test
This is a refactor which removes our duplicated scoring logic for
ChunkMatch vs LineMatch. Instead we score slices of []*candidateMatch.
Other than being a good refactor, candidateMatch is a much more
appropriate structure to stuff in extra information for scoring than our
public APIs. So this enables the work we want to do around atom based
scoring.
The only behaviour change in this commit are two fixes:
- DocumentSection caching would fail if empty since we relied on the
empty cache to be non-nil. This lead to inflated ContentBytesLoaded.
- Empty FileMatch would still read in DocumentSection, even though it
wasn't needed.
Test Plan: go test. Our coverage is decent, we have lots of ranking
tests which did not change and we have hardcoded stats of work done
which did not change (except for the above fixes).
There are several fields which are optional in FileMatch. For example
branch and version are only set if indexing a git repo (zoekt can index
arbitrary files). We mark these fields as optional to marshal in JSON.
The main benefit of this is improving the readability of some golden
files.
Note: I noticed this when an unrelated commit touched these golden
files.
Test Plan: go test
This example was given in our channel recently as a good result, so lets
keep track of it to ensure we don't regress.
Test Plan: go test
This has been running for three months and we haven't found a need to
disable it.
Test Plan: go test ./...
I am often reading the output of String in traces and logs, and it is
really hard to parse since there are many fields and most are unset.
This is a quality of life improvement so it is much easier to scan the
output.
For example the default zoekt-webserver struct's string output goes from
a 456 byte string to
zoekt.SearchOptions{ ShardMaxMatchCount=100000 TotalMaxMatchCount=1000000 MaxWallTime=10s }
Test Plan: go test. The unit tests ensure I cover every field now and in
the future when fields are added.
This adds four cases which are exact phrases to search for.
"assets are not configured for this binary" finds the correct document,
but it isn't shown in the summary. Our target rank doesn't capture that
this could be better, but I will use the golden file to see if I improve
this.
"sourcegraph/server docker image build" is an example of an exact phrase
which also happens to contain words of highly ranked symbols. This leads
to the exact phrase getting buried. I want to see if I can boost that.
"bufio flush writer" should find the symbol bufioFlushWriter.
"coverage data writer" should find the symbol CoverageDataWriter.
Test Plan: go test
Structs related to matches can occur a lot in memory. As such there is
some value to ensuring the order of the fields is aligned to avoid
unneccessary padding.
The "fieldalignment" tool was used to find these changes.
Test Plan: go test
This adds a new field to the golden files "targetRank" which records the
rank of the document we are looking for. Additionally the document is
marked with "**" in the golden files.
Additionally we add a new golden file which contains recall@1, recall@5
and the MRR.
I set the target documents by looking at the existing results and
guessing which was the one we wanted based on memory. In some cases we
no longer had the top document, for example for generate unit test.
Test Plan: go test
While iterating on the ranking tests it is frustrating to wait for the
index to be built. This commit adds a "-shard_cache" flag to reuse the
computed shards. It defaults to off since this could lead to unexpected
results if we change how we index.
Test Plan: run go test several times with the flag, only the first run
should be slow. Then confirm without flag it is "slow".
$ go test -shard_cache
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 23.307s
$ go test -shard_cache
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 0.235s
$ go test -shard_cache
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 0.218s
$ go test
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 23.416s
We had ranking e2e tests living in the zoekt-archive-index cmd for
convenience since that contained useful functions for indexing a remote
tarball from the GitHub API. This commit splits the archive
functionality into a new internal/archive package and the ranking tests
into a new internal/e2e package.
The zoekt-archive-index code is now quite minimal. This is similiar to
how zoekt-git-index mostly just calls out to the gitindex package. What
is different is that archive package is marked internal, unlike
gitindex. gitindex should also be internal, but the code predates go's
support for internal.
I suspect more of our e2e tests will end up in this package.
Test Plan: go test ./...
This change uses the fact that candidate matches should be increasing in byte
offset, to avoid recounting runes on a line. Before this change if you have
many matches on the same line we would call `utf8.RuneCount` for each match,
which is a `O(nm)` algorithm where `n` is your line length and `m` is the
number of matches. After this change the complexity is `O(n)`.
I came across this while investigating slow performance for searching the
string "dev" on s2 taking 2s if the match limits where 100k instead of 10k.
With 10k it would take 0.04s. It turns out with the larger limit we ended up
searching a file were the word dev appeared many times on one line. Running a
profiler against the service came up with 96% of CPU time in `utf8.RuneCount`.
This commit adds a benchmark for the helper introduced to reuse RuneCounts.
Unsurprisingly the difference is massive between `O(nm)` and `O(n)` :)
name old time/op new time/op delta
ColumnHelper-32 299ms ± 2% 0ms ± 2% -99.97% (p=0.000 n=10+10)
Test Plan: Added tests and benchmarks.
I'd like to start taking advantage of the NumContextLines option, but have been running into an issue where, if context lines were to extend past the end of the file, we get the trailing newline at the end of the file.
This is not desirable because the empty slice after a trailing newline is not treated as a "line" by any editor, so when we split on newlines, an empty line is shown to the user. By the time we're splitting on newlines, we do not know whether or not we're at the end of the file, so we cannot know whether we can trim that trailing newline, or whether it is, correctly, an empty line that should be rendered.
This is really annoying stuff that bites us elsewhere as well (searcher, syntax highlighter). It might be nice to unify the definitions of "what is a line" in some library, but for now, I'll be happy with "don't return an extra line."
This change cleans up the Go ctags parser wrapper as a follow-up to #702. Specific changes:
* Remove synchronization in `lockedParser` and rename it to `CTagsParser`
* Push delegation to universal vs. SCIP ctags into parser wrapper
* Simplify document timeout logic
* Rename some files
We have noticed profiles where lumberjack is responsible for 1GB of
memory. This is surprising and we haven't used the log output in a long
time. Rather than debugging, we are going to remove it for now and when
we need detailed logs again we can add something back.
Test Plan: go test
We have a suspicion that when we switched to using the mmap-go library
it contributed to an issue where zoekt-webserver stops responding on
some linux versions. Our suspicion has something to do from us
hardcoding the page size to 4k to asking the kernel and how that
interacts with THP. Given we don't actually build for windows anymore,
we are partially reverting the mmap changes from
7424ac84bab3ec9ae247339909ffd84e5e3b1338
Test Plan: go test ./... on darwin. Then CI for linux testing.
This will allow us to use the new builtins.
Test Plan: go test ./... and CI
Currently, we use a single ctags process for indexing an entire repository.
Even though we build shards in parallel, they all share the same (single
threaded) ctags process. Since ctags is one of the most expensive parts of
shard building, this creates a bottleneck that can really slow down indexing.
This change proposes to launch a new ctags process per shard. For
`sgtest/megarepo`, this speeds up indexing by almost 2x (enabling scip-ctags
and setting `-parallelism=4`):
* Before: took 4 min 48 sec to index repo
* After: took 2 min 30 sec to index repo
This change allows the shard concurrency to be set through index options. This
is much more convenient than the current way to limit CPU through the server
flag `-cpu_fraction`.
In a follow-up we'll expose shard concurrency through Sourcegraph site config,
and pass it through these index options. Maybe we can deprecate and move away
from `-cpu_fraction` in favor of this approach.
This prevents local test failures when `SCIP_CTAGS_COMMAND` is set.
It seems only the comment was updated, the commit it pointed to was
incorrect. This should hopefully resolve an error in CI due to different
version of ctags between CI and local dev.
Test Plan: CI
This change makes a couple improvements to the doc content checks during indexing. When a large file is explicitly marked as "allowed", we don't enforce the max trigram count. However, we still iterated through all its trigrams and collected them in a map. Now we short-circuit the check to avoid counting all the trigrams.
We now also reuse the trigram map across documents. This makes sense, as it's always presized with the same capacity hint. This doesn't have a significant effect on indexing speed, but significantly reduces allocations
Additionally we add many queries we are tracking.
Test Plan: go test
This is an initial framework for having golden file results for search
results against a real repository. At first we have only added one query
and one repository, but it should be straightforward to grow this list
further.
The golden files we write to disk are a summary of results. This matches how
we have been using the zoekt CLI tool on the keyword branch during our ranking
work.
Test Plan: go test
By default, `zoekt-sourcegraph-indexserver` builds one repo at a time. For each
repo, shards are built in parallel using a number of threads equal to available
CPUs. There are two ways to adjust the indexing concurrency:
1. Passing `cpu_fraction`, which limits the available CPUs for parallel shard
building
2. Passing `index_concurrency` (or setting the `SRC_INDEX_CONCURRENCY`
environment variable), to index more than one repo at once
If you set `index_concurrency` to some number greater than 1, then indexing
will use more threads than available CPUs. This seems undesirable, especially
if you set `cpu_fraction`, since you'd expect that to put an upper bound on CPU
usage.
This changes the shard-level parallelism to `available CPUs / index_concurrency`
(rounded down), to bound the CPU usage as expected.
Previously, we didn't require that scip-ctags be available even if if you set
`require_ctags` and set `language_map` to something like `go:scip`. Instead, we
silently fell back to `universal-ctags`. This is tricky and could mask a real
issue where we expect scip-ctags to be available but it isn't.
Now, we check if SCIP is needed based on `language_map`, and if so require that
scip-ctags is available.
From reading the implementation recently I noticed that we should always
go via evalMatchTree to ensure we correctly use the known map for
caching. This updates the two call sites we didn't do this as well as
documenting this requirement.
I don't think this caused any noticeable performance issues. We would do
slightly more work in the case the root matchTree was an and, but it
would have been miniscule.
Test Plan: go test
The pair of bools (matches, sure) often was quite hard to reason about.
I think this came down to them often not being named at return sites,
the names itself being too concise and that two booleans represents 4
possible states, but we only had two possible states.
This commit uses a more verbose enum instead of booleans. From reading
the diff back I find the code easier to reason about, so I think this is
a good change.
Test Plan: go test ./...
I experimented with some changes to encourage `go-git` to use less memory. They
didn't pan out, but this intermediate refactor felt useful on its own. It helps
break up the super long `indexGitRepo` method.
universal-ctags sometimes returns lines that are out of bounds. In
practice it seems to only do an off by one. We haven't noticed the
linenum error until a recent change of mine which didn't append an extra
entry to NLS if the file was terminated by "\n". In practice this would
end up being filtered out later on. So we update to just continue rather
than error here.
An example is
https://github.com/sourcegraph/sourcegraph/blob/v5.2.2/client/web-sveltekit/.storybook/main.ts
$ universal-ctags '--fields=*' --output-format=json main.ts | grep 22
{"_type": "tag", "name": "config", "path": "main.ts", "pattern": "/^export default config$/", "language": "TypeScript", "line": 22, "kind": "constant", "roles": "def"}
$ wc -l main.ts
21 main.ts
$ tail -n1 main.ts
export default config
Test Plan: added a unit test
We replace all calls to Regexp.String with a vendored version which is
faster.
go1.22 introduced a commit which "minimizes" the string returned by
Regexp.String(). Part of what it does is run enumerate through literals
runes in your string to see calculate flags related to unicode and case
sensitivity. This can be quite slow, but is made worse by the fact we
call it per shard per regexp in your query.Q to construct the matchtree.
Currently Regexp.String() represents 40% of CPU time on sourcegraph.com.
Before go1.22 it was ~0%.
Note: This is a temporary change to resolve the issue. I have a deeper
change to make this less clumsy.
Note: In one place we remove the use of string by relying on
Regexp.Equal instead.
Test Plan: go test
This reverts commit 931974dd953e07ce8fbe28e3b615323c2d156e40.
After monitoring production we are seeing increased latencies in List
calls. Additionally we are seeing many more scheduler transitions into
interactive queued. The only realistic cause of this was this commit, so
we are reverting for now until further investigation.
Test Plan: go test
The previous commit which added support for selectRepoSet in List was
ineffective since the queries we get for List do not look like (and ...)
but instead directly specify the reposet atom.
This commit adds support for it. Additionally to help with our manual
testing we add support for "repo:" query in the selectRepoSet
optimization.
While pairing on this we decided to improve the debug output to tracing
and rename the misnamed query.Q variable in List from r to q.
Test Plan: updated the test case to directly test the sort of queries we
get in List calls. Additionally ran "zoekt-webserver -pprof" and
observed in net/trace output that selectRepoSet optimization reduced the
number of searched shards.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
Previously List would never call proc.Yield, which broke the
co-operative scheduler in the case of slow List calls. Additionally, we
used a naive concurrency (a large buffered channel) which shows up in
the profiler as 30% of CPU spent on chan_ related operations under List.
This commit follows how Search used to respect proc.Yield. See sched.go
in 90ed7bfd07e0a3137e3ee627cbf6824446df9c4d. We did not copy Search
since it uses a more complicated implementation than we need since it
supports streaming, while List is still batch only.
We needed to use errgroup to ensure we drained all channels in the case
of an error. Previously we did not need to do this since the channels
had a buffer size of len(shards), which gaurenteed nothing would ever
block. Now channels are never larger than the number of workers (<=
GOMAXPROCS).
Test Plan: go test covers the no error cases. In the case of errors we
manually tested by running zoekt-webserver and adding a random context
cancellation. We observed the error being reported and no List
goroutines running.
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
In go 1.22 the way go converts a query like .* into a String changed.
This meant that when we compiled zoekt in the sourcegraph codebase our
"regex matches all" optimization stopped working.
This commit moves us to a simple inspection of the syntax tree which
will be stable across versions and to be honest just feels better.
To do this we introduce a field "origRegexp" to the regexpMatchTree so
we can avoid needing to recompile the syntax tree. Note: Regexp does not
offer a way to get at its internal syntax.Regexp which would be nice.
Test Plan: go test and added a unit test
We no longer serve these old HTTP APIs starting with Sourcegraph 5.3, so this code is no longer required.
I hope this makes your lives a little easier by not having to maintain two code paths.
If this change is annoying (because you cannot backport fixes to older versions as easily), feel free to either keep it on hold or close it.
## Test plan
CI still passes, anything else I should be trying?
Currently, we truncate a file's contents to 2048 bytes before passing it to
`go-enry`. I ran into a few cases where this is causing us to misclassify
files.
This PR removes the truncation. It should still be fine in terms of
performance, since `go-enry` is quite fast in general: ~1ms in my local
testing, even for large files. And we only run language detection if we plan to
index the file, which means we skip binary files and large files.
This updates two of our dependencies to resolve GHSA-9763-4f94-gfch and
CVE-2023-47108.
Initially I ran
go get \
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@0.46.0 \
github.com/cloudflare/circl@1.3.7
But then "go mod tidy" failed since otel likes to refactor and rename
things for not much benefit to end users. I tried a few different things
to get it to work, but finally just updated all otel deps to the latest
version:
go get -u go.opentelemetry.io/otel/...
Test Plan: unit tests, "go mod tidy" is clean and "trivy fs go.mod"
reports no vulns.
This commit introduces a new primitive Boost to our query language. It
allows boosting (or dampening) the contribution to the score a query
atoms will match contribute.
To achieve this we introduce boostMatchTree which records this weight.
We then adjust the visitMatches to take an initial score weight (1.0),
and then each time we recurse through a boostMatchTree the score weight
is multiplied by the boost weight. Additionally candidateMatch now has a
new field, scoreWeight, which records the weight at time of candidate
collection. Without boosting in the query this value will always be 1.
Finally when scoring a candidateMatch we take the final score for it and
multiply it by scoreWeight.
Note: we do not expose a way to set this in the query language, only the
query API.
Test Plan: Manual testing against webserver via the new phrase-boost URL
param. Additionally updated ranking tests to use the phrase booster.
Many a time I have seen a slow search without tracing turned on. I then
go to visit the net/trace page to see what happened but it is missing
the stats to try work out why.
This commit will ensure we always log the aggregated statistics. This
should be cheap to do given Add is relatively fast and LazyPrintf will
only do the stringer operation if someone loads the debug page. It does
mean a Stats object lives a bit longer in memory, but it is small.
Test Plan: go test
When thinking about transforming queries like 'foo bar' into '(foo bar)
or "foo bar"' we would want to keep the phrase candidateMatch and not
throw it away in gatherMatches. By sorting longer matches before others
that start at the same offset we end up keeping those.
Note: this only affects ChunkMatch, since for LineMatch we merge when we
find overlaps.
Test Plan: This was quite hard to test with our existing e2e tests due
to them not recording offsets, only matching lines. So instead I am just
relying on the fact we didn't break anything and once we add proper
support for phrases we will have a test then.
This is a refactor which removes our duplicated scoring logic for
ChunkMatch vs LineMatch. Instead we score slices of []*candidateMatch.
Other than being a good refactor, candidateMatch is a much more
appropriate structure to stuff in extra information for scoring than our
public APIs. So this enables the work we want to do around atom based
scoring.
The only behaviour change in this commit are two fixes:
- DocumentSection caching would fail if empty since we relied on the
empty cache to be non-nil. This lead to inflated ContentBytesLoaded.
- Empty FileMatch would still read in DocumentSection, even though it
wasn't needed.
Test Plan: go test. Our coverage is decent, we have lots of ranking
tests which did not change and we have hardcoded stats of work done
which did not change (except for the above fixes).
There are several fields which are optional in FileMatch. For example
branch and version are only set if indexing a git repo (zoekt can index
arbitrary files). We mark these fields as optional to marshal in JSON.
The main benefit of this is improving the readability of some golden
files.
Note: I noticed this when an unrelated commit touched these golden
files.
Test Plan: go test
I am often reading the output of String in traces and logs, and it is
really hard to parse since there are many fields and most are unset.
This is a quality of life improvement so it is much easier to scan the
output.
For example the default zoekt-webserver struct's string output goes from
a 456 byte string to
zoekt.SearchOptions{ ShardMaxMatchCount=100000 TotalMaxMatchCount=1000000 MaxWallTime=10s }
Test Plan: go test. The unit tests ensure I cover every field now and in
the future when fields are added.
This adds four cases which are exact phrases to search for.
"assets are not configured for this binary" finds the correct document,
but it isn't shown in the summary. Our target rank doesn't capture that
this could be better, but I will use the golden file to see if I improve
this.
"sourcegraph/server docker image build" is an example of an exact phrase
which also happens to contain words of highly ranked symbols. This leads
to the exact phrase getting buried. I want to see if I can boost that.
"bufio flush writer" should find the symbol bufioFlushWriter.
"coverage data writer" should find the symbol CoverageDataWriter.
Test Plan: go test
This adds a new field to the golden files "targetRank" which records the
rank of the document we are looking for. Additionally the document is
marked with "**" in the golden files.
Additionally we add a new golden file which contains recall@1, recall@5
and the MRR.
I set the target documents by looking at the existing results and
guessing which was the one we wanted based on memory. In some cases we
no longer had the top document, for example for generate unit test.
Test Plan: go test
While iterating on the ranking tests it is frustrating to wait for the
index to be built. This commit adds a "-shard_cache" flag to reuse the
computed shards. It defaults to off since this could lead to unexpected
results if we change how we index.
Test Plan: run go test several times with the flag, only the first run
should be slow. Then confirm without flag it is "slow".
$ go test -shard_cache
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 23.307s
$ go test -shard_cache
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 0.235s
$ go test -shard_cache
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 0.218s
$ go test
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 23.416s
We had ranking e2e tests living in the zoekt-archive-index cmd for
convenience since that contained useful functions for indexing a remote
tarball from the GitHub API. This commit splits the archive
functionality into a new internal/archive package and the ranking tests
into a new internal/e2e package.
The zoekt-archive-index code is now quite minimal. This is similiar to
how zoekt-git-index mostly just calls out to the gitindex package. What
is different is that archive package is marked internal, unlike
gitindex. gitindex should also be internal, but the code predates go's
support for internal.
I suspect more of our e2e tests will end up in this package.
Test Plan: go test ./...
This change uses the fact that candidate matches should be increasing in byte
offset, to avoid recounting runes on a line. Before this change if you have
many matches on the same line we would call `utf8.RuneCount` for each match,
which is a `O(nm)` algorithm where `n` is your line length and `m` is the
number of matches. After this change the complexity is `O(n)`.
I came across this while investigating slow performance for searching the
string "dev" on s2 taking 2s if the match limits where 100k instead of 10k.
With 10k it would take 0.04s. It turns out with the larger limit we ended up
searching a file were the word dev appeared many times on one line. Running a
profiler against the service came up with 96% of CPU time in `utf8.RuneCount`.
This commit adds a benchmark for the helper introduced to reuse RuneCounts.
Unsurprisingly the difference is massive between `O(nm)` and `O(n)` :)
name old time/op new time/op delta
ColumnHelper-32 299ms ± 2% 0ms ± 2% -99.97% (p=0.000 n=10+10)
Test Plan: Added tests and benchmarks.
I'd like to start taking advantage of the NumContextLines option, but have been running into an issue where, if context lines were to extend past the end of the file, we get the trailing newline at the end of the file.
This is not desirable because the empty slice after a trailing newline is not treated as a "line" by any editor, so when we split on newlines, an empty line is shown to the user. By the time we're splitting on newlines, we do not know whether or not we're at the end of the file, so we cannot know whether we can trim that trailing newline, or whether it is, correctly, an empty line that should be rendered.
This is really annoying stuff that bites us elsewhere as well (searcher, syntax highlighter). It might be nice to unify the definitions of "what is a line" in some library, but for now, I'll be happy with "don't return an extra line."
We have a suspicion that when we switched to using the mmap-go library
it contributed to an issue where zoekt-webserver stops responding on
some linux versions. Our suspicion has something to do from us
hardcoding the page size to 4k to asking the kernel and how that
interacts with THP. Given we don't actually build for windows anymore,
we are partially reverting the mmap changes from
7424ac84bab3ec9ae247339909ffd84e5e3b1338
Test Plan: go test ./... on darwin. Then CI for linux testing.
Currently, we use a single ctags process for indexing an entire repository.
Even though we build shards in parallel, they all share the same (single
threaded) ctags process. Since ctags is one of the most expensive parts of
shard building, this creates a bottleneck that can really slow down indexing.
This change proposes to launch a new ctags process per shard. For
`sgtest/megarepo`, this speeds up indexing by almost 2x (enabling scip-ctags
and setting `-parallelism=4`):
* Before: took 4 min 48 sec to index repo
* After: took 2 min 30 sec to index repo
This change allows the shard concurrency to be set through index options. This
is much more convenient than the current way to limit CPU through the server
flag `-cpu_fraction`.
In a follow-up we'll expose shard concurrency through Sourcegraph site config,
and pass it through these index options. Maybe we can deprecate and move away
from `-cpu_fraction` in favor of this approach.
This change makes a couple improvements to the doc content checks during indexing. When a large file is explicitly marked as "allowed", we don't enforce the max trigram count. However, we still iterated through all its trigrams and collected them in a map. Now we short-circuit the check to avoid counting all the trigrams.
We now also reuse the trigram map across documents. This makes sense, as it's always presized with the same capacity hint. This doesn't have a significant effect on indexing speed, but significantly reduces allocations
This is an initial framework for having golden file results for search
results against a real repository. At first we have only added one query
and one repository, but it should be straightforward to grow this list
further.
The golden files we write to disk are a summary of results. This matches how
we have been using the zoekt CLI tool on the keyword branch during our ranking
work.
Test Plan: go test
By default, `zoekt-sourcegraph-indexserver` builds one repo at a time. For each
repo, shards are built in parallel using a number of threads equal to available
CPUs. There are two ways to adjust the indexing concurrency:
1. Passing `cpu_fraction`, which limits the available CPUs for parallel shard
building
2. Passing `index_concurrency` (or setting the `SRC_INDEX_CONCURRENCY`
environment variable), to index more than one repo at once
If you set `index_concurrency` to some number greater than 1, then indexing
will use more threads than available CPUs. This seems undesirable, especially
if you set `cpu_fraction`, since you'd expect that to put an upper bound on CPU
usage.
This changes the shard-level parallelism to `available CPUs / index_concurrency`
(rounded down), to bound the CPU usage as expected.
Previously, we didn't require that scip-ctags be available even if if you set
`require_ctags` and set `language_map` to something like `go:scip`. Instead, we
silently fell back to `universal-ctags`. This is tricky and could mask a real
issue where we expect scip-ctags to be available but it isn't.
Now, we check if SCIP is needed based on `language_map`, and if so require that
scip-ctags is available.
From reading the implementation recently I noticed that we should always
go via evalMatchTree to ensure we correctly use the known map for
caching. This updates the two call sites we didn't do this as well as
documenting this requirement.
I don't think this caused any noticeable performance issues. We would do
slightly more work in the case the root matchTree was an and, but it
would have been miniscule.
Test Plan: go test
The pair of bools (matches, sure) often was quite hard to reason about.
I think this came down to them often not being named at return sites,
the names itself being too concise and that two booleans represents 4
possible states, but we only had two possible states.
This commit uses a more verbose enum instead of booleans. From reading
the diff back I find the code easier to reason about, so I think this is
a good change.
Test Plan: go test ./...
universal-ctags sometimes returns lines that are out of bounds. In
practice it seems to only do an off by one. We haven't noticed the
linenum error until a recent change of mine which didn't append an extra
entry to NLS if the file was terminated by "\n". In practice this would
end up being filtered out later on. So we update to just continue rather
than error here.
An example is
https://github.com/sourcegraph/sourcegraph/blob/v5.2.2/client/web-sveltekit/.storybook/main.ts
$ universal-ctags '--fields=*' --output-format=json main.ts | grep 22
{"_type": "tag", "name": "config", "path": "main.ts", "pattern": "/^export default config$/", "language": "TypeScript", "line": 22, "kind": "constant", "roles": "def"}
$ wc -l main.ts
21 main.ts
$ tail -n1 main.ts
export default config
Test Plan: added a unit test