Commits
gofumpt is a stricter gofmt. I took a look at the changes and in general
they are nice. I don't think we need to enforce the use of gofumpt, but
I like the idea of running it every once in a while.
Test Plan: go test ./...
This updates two of our dependencies to resolve GHSA-9763-4f94-gfch and
CVE-2023-47108.
Initially I ran
go get \
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@0.46.0 \
github.com/cloudflare/circl@1.3.7
But then "go mod tidy" failed since otel likes to refactor and rename
things for not much benefit to end users. I tried a few different things
to get it to work, but finally just updated all otel deps to the latest
version:
go get -u go.opentelemetry.io/otel/...
Test Plan: unit tests, "go mod tidy" is clean and "trivy fs go.mod"
reports no vulns.
Ran "go get google.golang.org/grpc@v1.56.3". If I just use the latest
grpc it causes OTEL to update, which has a whole cascade of updates. I'd
rather tackle that in a seperate commit since last time I did that it
caused a bunch of pain in the sourcegraph repo.
Test Plan: go test
Noticed we weren't using this yet and that the API signatures had
changed.
Test Plan: go test
This fixes an annoying bug it had for Nix users on Darwin (ie me).
Test Plan: zoekt-webserver no longer log spams on startup.
Previously, shards crashed for queries like "foo type:file" if foo was
not present.
Test plan:
updated e2e test
This is just the outcome of running gofmt with the simplify option.
Test Plan: go test
This adds the new Boost query atom to the protobuf defintion.
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
The old behavior led to panics in NextDoc if the child was pruned.
This commit introduces a new primitive Boost to our query language. It
allows boosting (or dampening) the contribution to the score a query
atoms will match contribute.
To achieve this we introduce boostMatchTree which records this weight.
We then adjust the visitMatches to take an initial score weight (1.0),
and then each time we recurse through a boostMatchTree the score weight
is multiplied by the boost weight. Additionally candidateMatch now has a
new field, scoreWeight, which records the weight at time of candidate
collection. Without boosting in the query this value will always be 1.
Finally when scoring a candidateMatch we take the final score for it and
multiply it by scoreWeight.
Note: we do not expose a way to set this in the query language, only the
query API.
Test Plan: Manual testing against webserver via the new phrase-boost URL
param. Additionally updated ranking tests to use the phrase booster.
Many a time I have seen a slow search without tracing turned on. I then
go to visit the net/trace page to see what happened but it is missing
the stats to try work out why.
This commit will ensure we always log the aggregated statistics. This
should be cheap to do given Add is relatively fast and LazyPrintf will
only do the stringer operation if someone loads the debug page. It does
mean a Stats object lives a bit longer in memory, but it is small.
Test Plan: go test
When thinking about transforming queries like 'foo bar' into '(foo bar)
or "foo bar"' we would want to keep the phrase candidateMatch and not
throw it away in gatherMatches. By sorting longer matches before others
that start at the same offset we end up keeping those.
Note: this only affects ChunkMatch, since for LineMatch we merge when we
find overlaps.
Test Plan: This was quite hard to test with our existing e2e tests due
to them not recording offsets, only matching lines. So instead I am just
relying on the fact we didn't break anything and once we add proper
support for phrases we will have a test then.
My last PR missed this. Now that we only score by candidateMatch we can
remove the matchScore implementation which was used to score LineMatch.
Test Plan: go test
This is a refactor which removes our duplicated scoring logic for
ChunkMatch vs LineMatch. Instead we score slices of []*candidateMatch.
Other than being a good refactor, candidateMatch is a much more
appropriate structure to stuff in extra information for scoring than our
public APIs. So this enables the work we want to do around atom based
scoring.
The only behaviour change in this commit are two fixes:
- DocumentSection caching would fail if empty since we relied on the
empty cache to be non-nil. This lead to inflated ContentBytesLoaded.
- Empty FileMatch would still read in DocumentSection, even though it
wasn't needed.
Test Plan: go test. Our coverage is decent, we have lots of ranking
tests which did not change and we have hardcoded stats of work done
which did not change (except for the above fixes).
There are several fields which are optional in FileMatch. For example
branch and version are only set if indexing a git repo (zoekt can index
arbitrary files). We mark these fields as optional to marshal in JSON.
The main benefit of this is improving the readability of some golden
files.
Note: I noticed this when an unrelated commit touched these golden
files.
Test Plan: go test
This example was given in our channel recently as a good result, so lets
keep track of it to ensure we don't regress.
Test Plan: go test
This has been running for three months and we haven't found a need to
disable it.
Test Plan: go test ./...
I am often reading the output of String in traces and logs, and it is
really hard to parse since there are many fields and most are unset.
This is a quality of life improvement so it is much easier to scan the
output.
For example the default zoekt-webserver struct's string output goes from
a 456 byte string to
zoekt.SearchOptions{ ShardMaxMatchCount=100000 TotalMaxMatchCount=1000000 MaxWallTime=10s }
Test Plan: go test. The unit tests ensure I cover every field now and in
the future when fields are added.
This adds four cases which are exact phrases to search for.
"assets are not configured for this binary" finds the correct document,
but it isn't shown in the summary. Our target rank doesn't capture that
this could be better, but I will use the golden file to see if I improve
this.
"sourcegraph/server docker image build" is an example of an exact phrase
which also happens to contain words of highly ranked symbols. This leads
to the exact phrase getting buried. I want to see if I can boost that.
"bufio flush writer" should find the symbol bufioFlushWriter.
"coverage data writer" should find the symbol CoverageDataWriter.
Test Plan: go test
Structs related to matches can occur a lot in memory. As such there is
some value to ensuring the order of the fields is aligned to avoid
unneccessary padding.
The "fieldalignment" tool was used to find these changes.
Test Plan: go test
This adds a new field to the golden files "targetRank" which records the
rank of the document we are looking for. Additionally the document is
marked with "**" in the golden files.
Additionally we add a new golden file which contains recall@1, recall@5
and the MRR.
I set the target documents by looking at the existing results and
guessing which was the one we wanted based on memory. In some cases we
no longer had the top document, for example for generate unit test.
Test Plan: go test
While iterating on the ranking tests it is frustrating to wait for the
index to be built. This commit adds a "-shard_cache" flag to reuse the
computed shards. It defaults to off since this could lead to unexpected
results if we change how we index.
Test Plan: run go test several times with the flag, only the first run
should be slow. Then confirm without flag it is "slow".
$ go test -shard_cache
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 23.307s
$ go test -shard_cache
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 0.235s
$ go test -shard_cache
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 0.218s
$ go test
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 23.416s
We had ranking e2e tests living in the zoekt-archive-index cmd for
convenience since that contained useful functions for indexing a remote
tarball from the GitHub API. This commit splits the archive
functionality into a new internal/archive package and the ranking tests
into a new internal/e2e package.
The zoekt-archive-index code is now quite minimal. This is similiar to
how zoekt-git-index mostly just calls out to the gitindex package. What
is different is that archive package is marked internal, unlike
gitindex. gitindex should also be internal, but the code predates go's
support for internal.
I suspect more of our e2e tests will end up in this package.
Test Plan: go test ./...
This change uses the fact that candidate matches should be increasing in byte
offset, to avoid recounting runes on a line. Before this change if you have
many matches on the same line we would call `utf8.RuneCount` for each match,
which is a `O(nm)` algorithm where `n` is your line length and `m` is the
number of matches. After this change the complexity is `O(n)`.
I came across this while investigating slow performance for searching the
string "dev" on s2 taking 2s if the match limits where 100k instead of 10k.
With 10k it would take 0.04s. It turns out with the larger limit we ended up
searching a file were the word dev appeared many times on one line. Running a
profiler against the service came up with 96% of CPU time in `utf8.RuneCount`.
This commit adds a benchmark for the helper introduced to reuse RuneCounts.
Unsurprisingly the difference is massive between `O(nm)` and `O(n)` :)
name old time/op new time/op delta
ColumnHelper-32 299ms ± 2% 0ms ± 2% -99.97% (p=0.000 n=10+10)
Test Plan: Added tests and benchmarks.
I'd like to start taking advantage of the NumContextLines option, but have been running into an issue where, if context lines were to extend past the end of the file, we get the trailing newline at the end of the file.
This is not desirable because the empty slice after a trailing newline is not treated as a "line" by any editor, so when we split on newlines, an empty line is shown to the user. By the time we're splitting on newlines, we do not know whether or not we're at the end of the file, so we cannot know whether we can trim that trailing newline, or whether it is, correctly, an empty line that should be rendered.
This is really annoying stuff that bites us elsewhere as well (searcher, syntax highlighter). It might be nice to unify the definitions of "what is a line" in some library, but for now, I'll be happy with "don't return an extra line."
This change cleans up the Go ctags parser wrapper as a follow-up to #702. Specific changes:
* Remove synchronization in `lockedParser` and rename it to `CTagsParser`
* Push delegation to universal vs. SCIP ctags into parser wrapper
* Simplify document timeout logic
* Rename some files
We have noticed profiles where lumberjack is responsible for 1GB of
memory. This is surprising and we haven't used the log output in a long
time. Rather than debugging, we are going to remove it for now and when
we need detailed logs again we can add something back.
Test Plan: go test
We have a suspicion that when we switched to using the mmap-go library
it contributed to an issue where zoekt-webserver stops responding on
some linux versions. Our suspicion has something to do from us
hardcoding the page size to 4k to asking the kernel and how that
interacts with THP. Given we don't actually build for windows anymore,
we are partially reverting the mmap changes from
7424ac84bab3ec9ae247339909ffd84e5e3b1338
Test Plan: go test ./... on darwin. Then CI for linux testing.
This will allow us to use the new builtins.
Test Plan: go test ./... and CI
Currently, we use a single ctags process for indexing an entire repository.
Even though we build shards in parallel, they all share the same (single
threaded) ctags process. Since ctags is one of the most expensive parts of
shard building, this creates a bottleneck that can really slow down indexing.
This change proposes to launch a new ctags process per shard. For
`sgtest/megarepo`, this speeds up indexing by almost 2x (enabling scip-ctags
and setting `-parallelism=4`):
* Before: took 4 min 48 sec to index repo
* After: took 2 min 30 sec to index repo
This change allows the shard concurrency to be set through index options. This
is much more convenient than the current way to limit CPU through the server
flag `-cpu_fraction`.
In a follow-up we'll expose shard concurrency through Sourcegraph site config,
and pass it through these index options. Maybe we can deprecate and move away
from `-cpu_fraction` in favor of this approach.
This prevents local test failures when `SCIP_CTAGS_COMMAND` is set.
It seems only the comment was updated, the commit it pointed to was
incorrect. This should hopefully resolve an error in CI due to different
version of ctags between CI and local dev.
Test Plan: CI
This change makes a couple improvements to the doc content checks during indexing. When a large file is explicitly marked as "allowed", we don't enforce the max trigram count. However, we still iterated through all its trigrams and collected them in a map. Now we short-circuit the check to avoid counting all the trigrams.
We now also reuse the trigram map across documents. This makes sense, as it's always presized with the same capacity hint. This doesn't have a significant effect on indexing speed, but significantly reduces allocations
Additionally we add many queries we are tracking.
Test Plan: go test
This is an initial framework for having golden file results for search
results against a real repository. At first we have only added one query
and one repository, but it should be straightforward to grow this list
further.
The golden files we write to disk are a summary of results. This matches how
we have been using the zoekt CLI tool on the keyword branch during our ranking
work.
Test Plan: go test
By default, `zoekt-sourcegraph-indexserver` builds one repo at a time. For each
repo, shards are built in parallel using a number of threads equal to available
CPUs. There are two ways to adjust the indexing concurrency:
1. Passing `cpu_fraction`, which limits the available CPUs for parallel shard
building
2. Passing `index_concurrency` (or setting the `SRC_INDEX_CONCURRENCY`
environment variable), to index more than one repo at once
If you set `index_concurrency` to some number greater than 1, then indexing
will use more threads than available CPUs. This seems undesirable, especially
if you set `cpu_fraction`, since you'd expect that to put an upper bound on CPU
usage.
This changes the shard-level parallelism to `available CPUs / index_concurrency`
(rounded down), to bound the CPU usage as expected.
Previously, we didn't require that scip-ctags be available even if if you set
`require_ctags` and set `language_map` to something like `go:scip`. Instead, we
silently fell back to `universal-ctags`. This is tricky and could mask a real
issue where we expect scip-ctags to be available but it isn't.
Now, we check if SCIP is needed based on `language_map`, and if so require that
scip-ctags is available.
From reading the implementation recently I noticed that we should always
go via evalMatchTree to ensure we correctly use the known map for
caching. This updates the two call sites we didn't do this as well as
documenting this requirement.
I don't think this caused any noticeable performance issues. We would do
slightly more work in the case the root matchTree was an and, but it
would have been miniscule.
Test Plan: go test
The pair of bools (matches, sure) often was quite hard to reason about.
I think this came down to them often not being named at return sites,
the names itself being too concise and that two booleans represents 4
possible states, but we only had two possible states.
This commit uses a more verbose enum instead of booleans. From reading
the diff back I find the code easier to reason about, so I think this is
a good change.
Test Plan: go test ./...
I experimented with some changes to encourage `go-git` to use less memory. They
didn't pan out, but this intermediate refactor felt useful on its own. It helps
break up the super long `indexGitRepo` method.
universal-ctags sometimes returns lines that are out of bounds. In
practice it seems to only do an off by one. We haven't noticed the
linenum error until a recent change of mine which didn't append an extra
entry to NLS if the file was terminated by "\n". In practice this would
end up being filtered out later on. So we update to just continue rather
than error here.
An example is
https://github.com/sourcegraph/sourcegraph/blob/v5.2.2/client/web-sveltekit/.storybook/main.ts
$ universal-ctags '--fields=*' --output-format=json main.ts | grep 22
{"_type": "tag", "name": "config", "path": "main.ts", "pattern": "/^export default config$/", "language": "TypeScript", "line": 22, "kind": "constant", "roles": "def"}
$ wc -l main.ts
21 main.ts
$ tail -n1 main.ts
export default config
Test Plan: added a unit test
This PR makes the httpClient an interface with the minimal surface area required by the stream.Client.
This lets us inject a custom doer more easily, which can be beneficial for testing, and in the case of Sourcegraph, lets us augment the Doer with middlewares.
When indexing, we build shards in parallel based on the `parallelism` flag.
Each shard handles ~100MB of document contents, which should limit the memory
usage to roughly `100MB * parallelism`.
Looking at the size of the buffered document contents in memory profiles, we
see much higher usage than this. The issue seems to be that we continue to
buffer up documents even if all threads are busy building shards. This can be a
real problem if shards take a super long time to build (say because ctags is
slow) -- we could end up buffering a ton of content into memory at once.
This change fixes the throttling logic so we block indexing when all threads
are busy building shards.
This change makes a couple small improvements to how we handle skipped docs:
* Immediately skip ctags parsing if the content is `nil`
* Always sort skipped docs to the end of the shard. This seems like a nice
invariant. And generally it's good for performance to group data that is
expected to be accessed together and has similar content.
When indexing documents, we buffer up documents until we reach the shard size
limit (100MB), then flush the shard. If we decide to skip a document because
it's a binary file, then (naturally) we don't count its content size towards
the shard limit. But we still buffered the full document. So if there are a large
number of binary files, we could easily blow past the 100MB limit and run into
memory issues.
This change simply clears `Content` whenever `SkipReason` is set. The
invariant: a buffered document should only ever have `SkipReason` or `Content`,
not both.
This was a huge oversight that has lived in our codebase since we
introduced symbolRegexpMatchTree. Because we don't call prepare, we
don't correctly use the index for symbol regex queries. From some local
testing this makes a huge difference to performance.
Huge shout-out to @camdencheek who spotted this.
Test Plan: validated with some local searches that results remain the same and
that the statistics for the searches go up for IndexBytesLoaded, but go down
for ContentBytesLoaded, FilesConsidered, FilesLoaded, etc. Added unit tests
which assert the index is used. Also perf tested with hyperfine.
Hyperfine results:
Benchmark 1: ./zoekt-before -sym '^searcher$'
Time (mean ± σ): 93.0 ms ± 1.2 ms [User: 142.2 ms, System: 18.9 ms]
Range (min … max): 90.8 ms … 95.6 ms 31 runs
Benchmark 2: ./zoekt-after -sym '^searcher$'
Time (mean ± σ): 52.3 ms ± 0.5 ms [User: 76.3 ms, System: 13.0 ms]
Range (min … max): 50.7 ms … 53.4 ms 53 runs
Summary
'./zoekt-after -sym '^searcher$'' ran
1.78 ± 0.03 times faster than './zoekt-before -sym '^searcher$''
For that search, a random comparison of the zoekt stats:
| Stat | Before | After | Delta |
|---------------------- |---------- |--------- |----------- |
| ContentBytesLoaded | 199007382 | 22566033 | -176441349 |
| IndexBytesLoaded | 3527 | 165645 | 162118 |
| Crashes | 0 | 0 | 0 |
| Duration | 57956167 | 17568708 | -40387459 |
| FileCount | 28 | 28 | 0 |
| ShardFilesConsidered | 0 | 0 | 0 |
| FilesConsidered | 28477 | 766 | -27711 |
| FilesLoaded | 28477 | 766 | -27711 |
| FilesSkipped | 0 | 0 | 0 |
| ShardsScanned | 5 | 5 | 0 |
| ShardsSkipped | 0 | 0 | 0 |
| ShardsSkippedFilter | 0 | 0 | 0 |
| MatchCount | 29 | 29 | 0 |
| NgramMatches | 87 | 4407 | 4320 |
| NgramLookups | 644 | 644 | 0 |
| Wait | 5792 | 11500 | 5708 |
| MatchTreeConstruction | 498042 | 515248 | 17206 |
| MatchTreeSearch | 97661875 | 23089418 | -74572457 |
Analysis: An absolutely massive reduction in the number of files we consider.
This means we are actually using the index properly. eg look at
ContentBytesLoaded, Duration, FilesConsidered, FilesLoaded. You can also see
that IndexBytesLoaded has gone up since we now use it properly. This was on a
small corpus so will have huge impact in production.
Note that the random changes Wait, MatchTreeConstruction are random, but the
MatchTreeSearch change is a big deal since that is time spent searching after
analysing a query.
Firstly we use bytes.IndexByte for faster newLinesIndices. On my machine this
reduces wall clock time of BenchmarkTagsToSections by 38%. This is faster
since bytes.IndexByte relies on CPU specific optimizations to find the next
new line (eg uses AVX2 if available).
Secondly we reuse nls slice between calls to tagsToSections. I noticed in the
profiler a nonsignificant chunk in the garbage collector. The slice built by
newLinesIndices is allocated and thrown away for each call to tagsToSections.
This means we can re-use it which this commit implements by introducing a
struct storing the buffer. We now use this buffer per shard of symbols we
analyse.
old time/op new time/op delta
188µs ± 7% 101µs ± 3% -46.10% (p=0.000 n=10+10)
old alloc/op new alloc/op delta
79.3kB ± 0% 36.3kB ± 0% -54.24% (p=0.000 n=9+10)
old allocs/op new allocs/op delta
443 ± 0% 441 ± 0% -0.45% (p=0.000 n=10+10)
Test Plan: go test -bench BenchmarkTagsToSections
I found the code a bit hard to read before, this I believe is more
clear. I was also hoping for an improvement in the benchmarks, but the
improvement was statistically insignificant.
Test Plan: go test
This change adds a benchmark for the conversion from ctags output to Zoekt
document data, plus a tiny optimization to presize the symbol slices.
This adds a monitor which will report every minute the progress of
symbol analysis. Additionally, if a document is taking too long to
analyse (10s) we report it.
At first this is just reporting via stdlog. However, once we are
comfortable with thresholds around this we can likely also include a way
to kill analysis for a file.
Test Plan: Adjusted monitorReportStatus to 1s then indexed the
sourcegraph repo and inspected the output
$ go run ./cmd/zoekt-git-index -require_ctags ../sourcegraph/
2023/11/03 16:03:10 attempting to index 14533 total files
2023/11/03 16:03:13 DEBUG: symbol analysis still running for shard statistics: duration=1s symbols=15805 bytes=44288971
2023/11/03 16:03:14 DEBUG: symbol analysis still running for shard statistics: duration=2s symbols=26189 bytes=51564417
2023/11/03 16:03:15 DEBUG: symbol analysis still running for shard statistics: duration=3s symbols=55613 bytes=64748084
2023/11/03 16:03:16 DEBUG: symbol analysis still running for shard statistics: duration=4s symbols=86557 bytes=93771404
2023/11/03 16:03:17 DEBUG: symbol analysis still running for shard statistics: duration=5s symbols=125352 bytes=116319453
2023/11/03 16:03:18 symbol analysis finished for shard statistics: duration=5s symbols=142951 bytes=129180023
2023/11/03 16:03:22 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 283983298 index bytes (overhead 2.8), 14533 files processed
I then added a random sleep for a minute in a file to see the stuck
reporting:
$ go run ./cmd/zoekt-git-index -require_ctags ../sourcegraph/
2023/11/03 16:14:57 attempting to index 14533 total files
2023/11/03 16:15:15 WARN: symbol analysis for README.md (3485 bytes) has been running for 14s
2023/11/03 16:15:25 WARN: symbol analysis for README.md (3485 bytes) has been running for 24s
2023/11/03 16:15:45 WARN: symbol analysis for README.md (3485 bytes) has been running for 44s
2023/11/03 16:16:00 DEBUG: symbol analysis still running for shard statistics: duration=1m0s symbols=958 bytes=624329
2023/11/03 16:16:00 symbol analysis for README.md (size 3485 bytes) is done and found 4 symbols
2023/11/03 16:16:06 symbol analysis finished for shard statistics: duration=1m5s symbols=142951 bytes=129180023
2023/11/03 16:16:10 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 283983299 index bytes (overhead 2.8), 14533 files processed
This change refactors our end-to-end scoring tests and enables local testing
using the scip-ctags binary:
* Split scoring tests out of `e2e_test` and into their own file `scoring_test`
* Split huge test methods into targeted ones like `TestFileNameMatch`,
`TestJava`, `TestGo`, etc.
* For languages that scip-ctags supports, rerun the same cases using the
scip-ctags binary
To run scip-ctags tests locally, you can set the env variable
```
SCIP_CTAGS_COMMAND=<sourcegraph-repo>/dev/scip-ctags-dev
```
This doesn't yet update Zoekt CI to run scip-ctags tests. That will be tackled
in a follow-up.
On large repos, indexing might take quite a while and hit the indexing timeout.
This change helps debug these situations:
* Make the indexing timeout configurable through an env variable
`INDEXING_TIMEOUT`
* Add more info to progress logging: log the total number of files being
indexed, plus the file count per shard
SCIP ctags can output different kind names than universal-ctags (for example
`typeAlias` instead of `talias`). This change makes sure we handle different
names for the same kind. To do so, it refactors the logic so we first match
strings to standard kinds, then decide how these are scored for each language.
That way, you don't need to remember to cover all the possible kind names each
time you adjust scoring for a new language.
Also added basic tests for Ruby and Python to ensure we don't accidentally
change the scoring.
Right now our symbol analyser doesn't tell us if a symbol is exported. We add
a go specific tweak here to boost those results. Ideally this could be
something that is encoded in the symbol information.
Additionally we do downrank _test.go files via the doc-order. But in the case
of symbol matches the boosting overweighs doc order signficantly. I found the
extra downraking quite useful when experimenting.
Test Plan: lots of manual testing on the keyword branch
Right now we boost a file extension that hasn't been seen to the 3rd
position. This is gated by an environment variable which defaults to
on. I want to explore if there are ways we can turn on this behaviour
with the query language.
Test Plan: go run ./cmd/zoekt foo
Tiny change to use a type def, as we usually prefer those over type aliases.
We bump go-ctags to include markdown files in symbol analysis. We
already have ranking behaviour for markdown files but due to go-ctags it
won't be taken into account (until this commit).
Additionally we bump the version included in the docker container to
v6.0.0.
Test Plan: CI and "docker build -t zoekt ."
We add a bit more complexity to addScore but avoid allocating a string if debugScore = false.
Test plan:
I ran a couple of benchmarks and confirmed we don't allocate.
This updates two of our dependencies to resolve GHSA-9763-4f94-gfch and
CVE-2023-47108.
Initially I ran
go get \
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@0.46.0 \
github.com/cloudflare/circl@1.3.7
But then "go mod tidy" failed since otel likes to refactor and rename
things for not much benefit to end users. I tried a few different things
to get it to work, but finally just updated all otel deps to the latest
version:
go get -u go.opentelemetry.io/otel/...
Test Plan: unit tests, "go mod tidy" is clean and "trivy fs go.mod"
reports no vulns.
This commit introduces a new primitive Boost to our query language. It
allows boosting (or dampening) the contribution to the score a query
atoms will match contribute.
To achieve this we introduce boostMatchTree which records this weight.
We then adjust the visitMatches to take an initial score weight (1.0),
and then each time we recurse through a boostMatchTree the score weight
is multiplied by the boost weight. Additionally candidateMatch now has a
new field, scoreWeight, which records the weight at time of candidate
collection. Without boosting in the query this value will always be 1.
Finally when scoring a candidateMatch we take the final score for it and
multiply it by scoreWeight.
Note: we do not expose a way to set this in the query language, only the
query API.
Test Plan: Manual testing against webserver via the new phrase-boost URL
param. Additionally updated ranking tests to use the phrase booster.
Many a time I have seen a slow search without tracing turned on. I then
go to visit the net/trace page to see what happened but it is missing
the stats to try work out why.
This commit will ensure we always log the aggregated statistics. This
should be cheap to do given Add is relatively fast and LazyPrintf will
only do the stringer operation if someone loads the debug page. It does
mean a Stats object lives a bit longer in memory, but it is small.
Test Plan: go test
When thinking about transforming queries like 'foo bar' into '(foo bar)
or "foo bar"' we would want to keep the phrase candidateMatch and not
throw it away in gatherMatches. By sorting longer matches before others
that start at the same offset we end up keeping those.
Note: this only affects ChunkMatch, since for LineMatch we merge when we
find overlaps.
Test Plan: This was quite hard to test with our existing e2e tests due
to them not recording offsets, only matching lines. So instead I am just
relying on the fact we didn't break anything and once we add proper
support for phrases we will have a test then.
This is a refactor which removes our duplicated scoring logic for
ChunkMatch vs LineMatch. Instead we score slices of []*candidateMatch.
Other than being a good refactor, candidateMatch is a much more
appropriate structure to stuff in extra information for scoring than our
public APIs. So this enables the work we want to do around atom based
scoring.
The only behaviour change in this commit are two fixes:
- DocumentSection caching would fail if empty since we relied on the
empty cache to be non-nil. This lead to inflated ContentBytesLoaded.
- Empty FileMatch would still read in DocumentSection, even though it
wasn't needed.
Test Plan: go test. Our coverage is decent, we have lots of ranking
tests which did not change and we have hardcoded stats of work done
which did not change (except for the above fixes).
There are several fields which are optional in FileMatch. For example
branch and version are only set if indexing a git repo (zoekt can index
arbitrary files). We mark these fields as optional to marshal in JSON.
The main benefit of this is improving the readability of some golden
files.
Note: I noticed this when an unrelated commit touched these golden
files.
Test Plan: go test
I am often reading the output of String in traces and logs, and it is
really hard to parse since there are many fields and most are unset.
This is a quality of life improvement so it is much easier to scan the
output.
For example the default zoekt-webserver struct's string output goes from
a 456 byte string to
zoekt.SearchOptions{ ShardMaxMatchCount=100000 TotalMaxMatchCount=1000000 MaxWallTime=10s }
Test Plan: go test. The unit tests ensure I cover every field now and in
the future when fields are added.
This adds four cases which are exact phrases to search for.
"assets are not configured for this binary" finds the correct document,
but it isn't shown in the summary. Our target rank doesn't capture that
this could be better, but I will use the golden file to see if I improve
this.
"sourcegraph/server docker image build" is an example of an exact phrase
which also happens to contain words of highly ranked symbols. This leads
to the exact phrase getting buried. I want to see if I can boost that.
"bufio flush writer" should find the symbol bufioFlushWriter.
"coverage data writer" should find the symbol CoverageDataWriter.
Test Plan: go test
This adds a new field to the golden files "targetRank" which records the
rank of the document we are looking for. Additionally the document is
marked with "**" in the golden files.
Additionally we add a new golden file which contains recall@1, recall@5
and the MRR.
I set the target documents by looking at the existing results and
guessing which was the one we wanted based on memory. In some cases we
no longer had the top document, for example for generate unit test.
Test Plan: go test
While iterating on the ranking tests it is frustrating to wait for the
index to be built. This commit adds a "-shard_cache" flag to reuse the
computed shards. It defaults to off since this could lead to unexpected
results if we change how we index.
Test Plan: run go test several times with the flag, only the first run
should be slow. Then confirm without flag it is "slow".
$ go test -shard_cache
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 23.307s
$ go test -shard_cache
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 0.235s
$ go test -shard_cache
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 0.218s
$ go test
PASS
ok github.com/sourcegraph/zoekt/internal/e2e 23.416s
We had ranking e2e tests living in the zoekt-archive-index cmd for
convenience since that contained useful functions for indexing a remote
tarball from the GitHub API. This commit splits the archive
functionality into a new internal/archive package and the ranking tests
into a new internal/e2e package.
The zoekt-archive-index code is now quite minimal. This is similiar to
how zoekt-git-index mostly just calls out to the gitindex package. What
is different is that archive package is marked internal, unlike
gitindex. gitindex should also be internal, but the code predates go's
support for internal.
I suspect more of our e2e tests will end up in this package.
Test Plan: go test ./...
This change uses the fact that candidate matches should be increasing in byte
offset, to avoid recounting runes on a line. Before this change if you have
many matches on the same line we would call `utf8.RuneCount` for each match,
which is a `O(nm)` algorithm where `n` is your line length and `m` is the
number of matches. After this change the complexity is `O(n)`.
I came across this while investigating slow performance for searching the
string "dev" on s2 taking 2s if the match limits where 100k instead of 10k.
With 10k it would take 0.04s. It turns out with the larger limit we ended up
searching a file were the word dev appeared many times on one line. Running a
profiler against the service came up with 96% of CPU time in `utf8.RuneCount`.
This commit adds a benchmark for the helper introduced to reuse RuneCounts.
Unsurprisingly the difference is massive between `O(nm)` and `O(n)` :)
name old time/op new time/op delta
ColumnHelper-32 299ms ± 2% 0ms ± 2% -99.97% (p=0.000 n=10+10)
Test Plan: Added tests and benchmarks.
I'd like to start taking advantage of the NumContextLines option, but have been running into an issue where, if context lines were to extend past the end of the file, we get the trailing newline at the end of the file.
This is not desirable because the empty slice after a trailing newline is not treated as a "line" by any editor, so when we split on newlines, an empty line is shown to the user. By the time we're splitting on newlines, we do not know whether or not we're at the end of the file, so we cannot know whether we can trim that trailing newline, or whether it is, correctly, an empty line that should be rendered.
This is really annoying stuff that bites us elsewhere as well (searcher, syntax highlighter). It might be nice to unify the definitions of "what is a line" in some library, but for now, I'll be happy with "don't return an extra line."
We have a suspicion that when we switched to using the mmap-go library
it contributed to an issue where zoekt-webserver stops responding on
some linux versions. Our suspicion has something to do from us
hardcoding the page size to 4k to asking the kernel and how that
interacts with THP. Given we don't actually build for windows anymore,
we are partially reverting the mmap changes from
7424ac84bab3ec9ae247339909ffd84e5e3b1338
Test Plan: go test ./... on darwin. Then CI for linux testing.
Currently, we use a single ctags process for indexing an entire repository.
Even though we build shards in parallel, they all share the same (single
threaded) ctags process. Since ctags is one of the most expensive parts of
shard building, this creates a bottleneck that can really slow down indexing.
This change proposes to launch a new ctags process per shard. For
`sgtest/megarepo`, this speeds up indexing by almost 2x (enabling scip-ctags
and setting `-parallelism=4`):
* Before: took 4 min 48 sec to index repo
* After: took 2 min 30 sec to index repo
This change allows the shard concurrency to be set through index options. This
is much more convenient than the current way to limit CPU through the server
flag `-cpu_fraction`.
In a follow-up we'll expose shard concurrency through Sourcegraph site config,
and pass it through these index options. Maybe we can deprecate and move away
from `-cpu_fraction` in favor of this approach.
This change makes a couple improvements to the doc content checks during indexing. When a large file is explicitly marked as "allowed", we don't enforce the max trigram count. However, we still iterated through all its trigrams and collected them in a map. Now we short-circuit the check to avoid counting all the trigrams.
We now also reuse the trigram map across documents. This makes sense, as it's always presized with the same capacity hint. This doesn't have a significant effect on indexing speed, but significantly reduces allocations
This is an initial framework for having golden file results for search
results against a real repository. At first we have only added one query
and one repository, but it should be straightforward to grow this list
further.
The golden files we write to disk are a summary of results. This matches how
we have been using the zoekt CLI tool on the keyword branch during our ranking
work.
Test Plan: go test
By default, `zoekt-sourcegraph-indexserver` builds one repo at a time. For each
repo, shards are built in parallel using a number of threads equal to available
CPUs. There are two ways to adjust the indexing concurrency:
1. Passing `cpu_fraction`, which limits the available CPUs for parallel shard
building
2. Passing `index_concurrency` (or setting the `SRC_INDEX_CONCURRENCY`
environment variable), to index more than one repo at once
If you set `index_concurrency` to some number greater than 1, then indexing
will use more threads than available CPUs. This seems undesirable, especially
if you set `cpu_fraction`, since you'd expect that to put an upper bound on CPU
usage.
This changes the shard-level parallelism to `available CPUs / index_concurrency`
(rounded down), to bound the CPU usage as expected.
Previously, we didn't require that scip-ctags be available even if if you set
`require_ctags` and set `language_map` to something like `go:scip`. Instead, we
silently fell back to `universal-ctags`. This is tricky and could mask a real
issue where we expect scip-ctags to be available but it isn't.
Now, we check if SCIP is needed based on `language_map`, and if so require that
scip-ctags is available.
From reading the implementation recently I noticed that we should always
go via evalMatchTree to ensure we correctly use the known map for
caching. This updates the two call sites we didn't do this as well as
documenting this requirement.
I don't think this caused any noticeable performance issues. We would do
slightly more work in the case the root matchTree was an and, but it
would have been miniscule.
Test Plan: go test
The pair of bools (matches, sure) often was quite hard to reason about.
I think this came down to them often not being named at return sites,
the names itself being too concise and that two booleans represents 4
possible states, but we only had two possible states.
This commit uses a more verbose enum instead of booleans. From reading
the diff back I find the code easier to reason about, so I think this is
a good change.
Test Plan: go test ./...
universal-ctags sometimes returns lines that are out of bounds. In
practice it seems to only do an off by one. We haven't noticed the
linenum error until a recent change of mine which didn't append an extra
entry to NLS if the file was terminated by "\n". In practice this would
end up being filtered out later on. So we update to just continue rather
than error here.
An example is
https://github.com/sourcegraph/sourcegraph/blob/v5.2.2/client/web-sveltekit/.storybook/main.ts
$ universal-ctags '--fields=*' --output-format=json main.ts | grep 22
{"_type": "tag", "name": "config", "path": "main.ts", "pattern": "/^export default config$/", "language": "TypeScript", "line": 22, "kind": "constant", "roles": "def"}
$ wc -l main.ts
21 main.ts
$ tail -n1 main.ts
export default config
Test Plan: added a unit test
When indexing, we build shards in parallel based on the `parallelism` flag.
Each shard handles ~100MB of document contents, which should limit the memory
usage to roughly `100MB * parallelism`.
Looking at the size of the buffered document contents in memory profiles, we
see much higher usage than this. The issue seems to be that we continue to
buffer up documents even if all threads are busy building shards. This can be a
real problem if shards take a super long time to build (say because ctags is
slow) -- we could end up buffering a ton of content into memory at once.
This change fixes the throttling logic so we block indexing when all threads
are busy building shards.
This change makes a couple small improvements to how we handle skipped docs:
* Immediately skip ctags parsing if the content is `nil`
* Always sort skipped docs to the end of the shard. This seems like a nice
invariant. And generally it's good for performance to group data that is
expected to be accessed together and has similar content.
When indexing documents, we buffer up documents until we reach the shard size
limit (100MB), then flush the shard. If we decide to skip a document because
it's a binary file, then (naturally) we don't count its content size towards
the shard limit. But we still buffered the full document. So if there are a large
number of binary files, we could easily blow past the 100MB limit and run into
memory issues.
This change simply clears `Content` whenever `SkipReason` is set. The
invariant: a buffered document should only ever have `SkipReason` or `Content`,
not both.
This was a huge oversight that has lived in our codebase since we
introduced symbolRegexpMatchTree. Because we don't call prepare, we
don't correctly use the index for symbol regex queries. From some local
testing this makes a huge difference to performance.
Huge shout-out to @camdencheek who spotted this.
Test Plan: validated with some local searches that results remain the same and
that the statistics for the searches go up for IndexBytesLoaded, but go down
for ContentBytesLoaded, FilesConsidered, FilesLoaded, etc. Added unit tests
which assert the index is used. Also perf tested with hyperfine.
Hyperfine results:
Benchmark 1: ./zoekt-before -sym '^searcher$'
Time (mean ± σ): 93.0 ms ± 1.2 ms [User: 142.2 ms, System: 18.9 ms]
Range (min … max): 90.8 ms … 95.6 ms 31 runs
Benchmark 2: ./zoekt-after -sym '^searcher$'
Time (mean ± σ): 52.3 ms ± 0.5 ms [User: 76.3 ms, System: 13.0 ms]
Range (min … max): 50.7 ms … 53.4 ms 53 runs
Summary
'./zoekt-after -sym '^searcher$'' ran
1.78 ± 0.03 times faster than './zoekt-before -sym '^searcher$''
For that search, a random comparison of the zoekt stats:
| Stat | Before | After | Delta |
|---------------------- |---------- |--------- |----------- |
| ContentBytesLoaded | 199007382 | 22566033 | -176441349 |
| IndexBytesLoaded | 3527 | 165645 | 162118 |
| Crashes | 0 | 0 | 0 |
| Duration | 57956167 | 17568708 | -40387459 |
| FileCount | 28 | 28 | 0 |
| ShardFilesConsidered | 0 | 0 | 0 |
| FilesConsidered | 28477 | 766 | -27711 |
| FilesLoaded | 28477 | 766 | -27711 |
| FilesSkipped | 0 | 0 | 0 |
| ShardsScanned | 5 | 5 | 0 |
| ShardsSkipped | 0 | 0 | 0 |
| ShardsSkippedFilter | 0 | 0 | 0 |
| MatchCount | 29 | 29 | 0 |
| NgramMatches | 87 | 4407 | 4320 |
| NgramLookups | 644 | 644 | 0 |
| Wait | 5792 | 11500 | 5708 |
| MatchTreeConstruction | 498042 | 515248 | 17206 |
| MatchTreeSearch | 97661875 | 23089418 | -74572457 |
Analysis: An absolutely massive reduction in the number of files we consider.
This means we are actually using the index properly. eg look at
ContentBytesLoaded, Duration, FilesConsidered, FilesLoaded. You can also see
that IndexBytesLoaded has gone up since we now use it properly. This was on a
small corpus so will have huge impact in production.
Note that the random changes Wait, MatchTreeConstruction are random, but the
MatchTreeSearch change is a big deal since that is time spent searching after
analysing a query.
Firstly we use bytes.IndexByte for faster newLinesIndices. On my machine this
reduces wall clock time of BenchmarkTagsToSections by 38%. This is faster
since bytes.IndexByte relies on CPU specific optimizations to find the next
new line (eg uses AVX2 if available).
Secondly we reuse nls slice between calls to tagsToSections. I noticed in the
profiler a nonsignificant chunk in the garbage collector. The slice built by
newLinesIndices is allocated and thrown away for each call to tagsToSections.
This means we can re-use it which this commit implements by introducing a
struct storing the buffer. We now use this buffer per shard of symbols we
analyse.
old time/op new time/op delta
188µs ± 7% 101µs ± 3% -46.10% (p=0.000 n=10+10)
old alloc/op new alloc/op delta
79.3kB ± 0% 36.3kB ± 0% -54.24% (p=0.000 n=9+10)
old allocs/op new allocs/op delta
443 ± 0% 441 ± 0% -0.45% (p=0.000 n=10+10)
Test Plan: go test -bench BenchmarkTagsToSections
This adds a monitor which will report every minute the progress of
symbol analysis. Additionally, if a document is taking too long to
analyse (10s) we report it.
At first this is just reporting via stdlog. However, once we are
comfortable with thresholds around this we can likely also include a way
to kill analysis for a file.
Test Plan: Adjusted monitorReportStatus to 1s then indexed the
sourcegraph repo and inspected the output
$ go run ./cmd/zoekt-git-index -require_ctags ../sourcegraph/
2023/11/03 16:03:10 attempting to index 14533 total files
2023/11/03 16:03:13 DEBUG: symbol analysis still running for shard statistics: duration=1s symbols=15805 bytes=44288971
2023/11/03 16:03:14 DEBUG: symbol analysis still running for shard statistics: duration=2s symbols=26189 bytes=51564417
2023/11/03 16:03:15 DEBUG: symbol analysis still running for shard statistics: duration=3s symbols=55613 bytes=64748084
2023/11/03 16:03:16 DEBUG: symbol analysis still running for shard statistics: duration=4s symbols=86557 bytes=93771404
2023/11/03 16:03:17 DEBUG: symbol analysis still running for shard statistics: duration=5s symbols=125352 bytes=116319453
2023/11/03 16:03:18 symbol analysis finished for shard statistics: duration=5s symbols=142951 bytes=129180023
2023/11/03 16:03:22 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 283983298 index bytes (overhead 2.8), 14533 files processed
I then added a random sleep for a minute in a file to see the stuck
reporting:
$ go run ./cmd/zoekt-git-index -require_ctags ../sourcegraph/
2023/11/03 16:14:57 attempting to index 14533 total files
2023/11/03 16:15:15 WARN: symbol analysis for README.md (3485 bytes) has been running for 14s
2023/11/03 16:15:25 WARN: symbol analysis for README.md (3485 bytes) has been running for 24s
2023/11/03 16:15:45 WARN: symbol analysis for README.md (3485 bytes) has been running for 44s
2023/11/03 16:16:00 DEBUG: symbol analysis still running for shard statistics: duration=1m0s symbols=958 bytes=624329
2023/11/03 16:16:00 symbol analysis for README.md (size 3485 bytes) is done and found 4 symbols
2023/11/03 16:16:06 symbol analysis finished for shard statistics: duration=1m5s symbols=142951 bytes=129180023
2023/11/03 16:16:10 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 283983299 index bytes (overhead 2.8), 14533 files processed
This change refactors our end-to-end scoring tests and enables local testing
using the scip-ctags binary:
* Split scoring tests out of `e2e_test` and into their own file `scoring_test`
* Split huge test methods into targeted ones like `TestFileNameMatch`,
`TestJava`, `TestGo`, etc.
* For languages that scip-ctags supports, rerun the same cases using the
scip-ctags binary
To run scip-ctags tests locally, you can set the env variable
```
SCIP_CTAGS_COMMAND=<sourcegraph-repo>/dev/scip-ctags-dev
```
This doesn't yet update Zoekt CI to run scip-ctags tests. That will be tackled
in a follow-up.
On large repos, indexing might take quite a while and hit the indexing timeout.
This change helps debug these situations:
* Make the indexing timeout configurable through an env variable
`INDEXING_TIMEOUT`
* Add more info to progress logging: log the total number of files being
indexed, plus the file count per shard
SCIP ctags can output different kind names than universal-ctags (for example
`typeAlias` instead of `talias`). This change makes sure we handle different
names for the same kind. To do so, it refactors the logic so we first match
strings to standard kinds, then decide how these are scored for each language.
That way, you don't need to remember to cover all the possible kind names each
time you adjust scoring for a new language.
Also added basic tests for Ruby and Python to ensure we don't accidentally
change the scoring.
Right now our symbol analyser doesn't tell us if a symbol is exported. We add
a go specific tweak here to boost those results. Ideally this could be
something that is encoded in the symbol information.
Additionally we do downrank _test.go files via the doc-order. But in the case
of symbol matches the boosting overweighs doc order signficantly. I found the
extra downraking quite useful when experimenting.
Test Plan: lots of manual testing on the keyword branch
We bump go-ctags to include markdown files in symbol analysis. We
already have ranking behaviour for markdown files but due to go-ctags it
won't be taken into account (until this commit).
Additionally we bump the version included in the docker container to
v6.0.0.
Test Plan: CI and "docker build -t zoekt ."