Commits
This is an initial framework for having golden file results for search
results against a real repository. At first we have only added one query
and one repository, but it should be straightforward to grow this list
further.
The golden files we write to disk are a summary of results. This matches how
we have been using the zoekt CLI tool on the keyword branch during our ranking
work.
Test Plan: go test
By default, `zoekt-sourcegraph-indexserver` builds one repo at a time. For each
repo, shards are built in parallel using a number of threads equal to available
CPUs. There are two ways to adjust the indexing concurrency:
1. Passing `cpu_fraction`, which limits the available CPUs for parallel shard
building
2. Passing `index_concurrency` (or setting the `SRC_INDEX_CONCURRENCY`
environment variable), to index more than one repo at once
If you set `index_concurrency` to some number greater than 1, then indexing
will use more threads than available CPUs. This seems undesirable, especially
if you set `cpu_fraction`, since you'd expect that to put an upper bound on CPU
usage.
This changes the shard-level parallelism to `available CPUs / index_concurrency`
(rounded down), to bound the CPU usage as expected.
Previously, we didn't require that scip-ctags be available even if if you set
`require_ctags` and set `language_map` to something like `go:scip`. Instead, we
silently fell back to `universal-ctags`. This is tricky and could mask a real
issue where we expect scip-ctags to be available but it isn't.
Now, we check if SCIP is needed based on `language_map`, and if so require that
scip-ctags is available.
From reading the implementation recently I noticed that we should always
go via evalMatchTree to ensure we correctly use the known map for
caching. This updates the two call sites we didn't do this as well as
documenting this requirement.
I don't think this caused any noticeable performance issues. We would do
slightly more work in the case the root matchTree was an and, but it
would have been miniscule.
Test Plan: go test
The pair of bools (matches, sure) often was quite hard to reason about.
I think this came down to them often not being named at return sites,
the names itself being too concise and that two booleans represents 4
possible states, but we only had two possible states.
This commit uses a more verbose enum instead of booleans. From reading
the diff back I find the code easier to reason about, so I think this is
a good change.
Test Plan: go test ./...
I experimented with some changes to encourage `go-git` to use less memory. They
didn't pan out, but this intermediate refactor felt useful on its own. It helps
break up the super long `indexGitRepo` method.
universal-ctags sometimes returns lines that are out of bounds. In
practice it seems to only do an off by one. We haven't noticed the
linenum error until a recent change of mine which didn't append an extra
entry to NLS if the file was terminated by "\n". In practice this would
end up being filtered out later on. So we update to just continue rather
than error here.
An example is
https://github.com/sourcegraph/sourcegraph/blob/v5.2.2/client/web-sveltekit/.storybook/main.ts
$ universal-ctags '--fields=*' --output-format=json main.ts | grep 22
{"_type": "tag", "name": "config", "path": "main.ts", "pattern": "/^export default config$/", "language": "TypeScript", "line": 22, "kind": "constant", "roles": "def"}
$ wc -l main.ts
21 main.ts
$ tail -n1 main.ts
export default config
Test Plan: added a unit test
This PR makes the httpClient an interface with the minimal surface area required by the stream.Client.
This lets us inject a custom doer more easily, which can be beneficial for testing, and in the case of Sourcegraph, lets us augment the Doer with middlewares.
When indexing, we build shards in parallel based on the `parallelism` flag.
Each shard handles ~100MB of document contents, which should limit the memory
usage to roughly `100MB * parallelism`.
Looking at the size of the buffered document contents in memory profiles, we
see much higher usage than this. The issue seems to be that we continue to
buffer up documents even if all threads are busy building shards. This can be a
real problem if shards take a super long time to build (say because ctags is
slow) -- we could end up buffering a ton of content into memory at once.
This change fixes the throttling logic so we block indexing when all threads
are busy building shards.
This change makes a couple small improvements to how we handle skipped docs:
* Immediately skip ctags parsing if the content is `nil`
* Always sort skipped docs to the end of the shard. This seems like a nice
invariant. And generally it's good for performance to group data that is
expected to be accessed together and has similar content.
When indexing documents, we buffer up documents until we reach the shard size
limit (100MB), then flush the shard. If we decide to skip a document because
it's a binary file, then (naturally) we don't count its content size towards
the shard limit. But we still buffered the full document. So if there are a large
number of binary files, we could easily blow past the 100MB limit and run into
memory issues.
This change simply clears `Content` whenever `SkipReason` is set. The
invariant: a buffered document should only ever have `SkipReason` or `Content`,
not both.
This was a huge oversight that has lived in our codebase since we
introduced symbolRegexpMatchTree. Because we don't call prepare, we
don't correctly use the index for symbol regex queries. From some local
testing this makes a huge difference to performance.
Huge shout-out to @camdencheek who spotted this.
Test Plan: validated with some local searches that results remain the same and
that the statistics for the searches go up for IndexBytesLoaded, but go down
for ContentBytesLoaded, FilesConsidered, FilesLoaded, etc. Added unit tests
which assert the index is used. Also perf tested with hyperfine.
Hyperfine results:
Benchmark 1: ./zoekt-before -sym '^searcher$'
Time (mean ± σ): 93.0 ms ± 1.2 ms [User: 142.2 ms, System: 18.9 ms]
Range (min … max): 90.8 ms … 95.6 ms 31 runs
Benchmark 2: ./zoekt-after -sym '^searcher$'
Time (mean ± σ): 52.3 ms ± 0.5 ms [User: 76.3 ms, System: 13.0 ms]
Range (min … max): 50.7 ms … 53.4 ms 53 runs
Summary
'./zoekt-after -sym '^searcher$'' ran
1.78 ± 0.03 times faster than './zoekt-before -sym '^searcher$''
For that search, a random comparison of the zoekt stats:
| Stat | Before | After | Delta |
|---------------------- |---------- |--------- |----------- |
| ContentBytesLoaded | 199007382 | 22566033 | -176441349 |
| IndexBytesLoaded | 3527 | 165645 | 162118 |
| Crashes | 0 | 0 | 0 |
| Duration | 57956167 | 17568708 | -40387459 |
| FileCount | 28 | 28 | 0 |
| ShardFilesConsidered | 0 | 0 | 0 |
| FilesConsidered | 28477 | 766 | -27711 |
| FilesLoaded | 28477 | 766 | -27711 |
| FilesSkipped | 0 | 0 | 0 |
| ShardsScanned | 5 | 5 | 0 |
| ShardsSkipped | 0 | 0 | 0 |
| ShardsSkippedFilter | 0 | 0 | 0 |
| MatchCount | 29 | 29 | 0 |
| NgramMatches | 87 | 4407 | 4320 |
| NgramLookups | 644 | 644 | 0 |
| Wait | 5792 | 11500 | 5708 |
| MatchTreeConstruction | 498042 | 515248 | 17206 |
| MatchTreeSearch | 97661875 | 23089418 | -74572457 |
Analysis: An absolutely massive reduction in the number of files we consider.
This means we are actually using the index properly. eg look at
ContentBytesLoaded, Duration, FilesConsidered, FilesLoaded. You can also see
that IndexBytesLoaded has gone up since we now use it properly. This was on a
small corpus so will have huge impact in production.
Note that the random changes Wait, MatchTreeConstruction are random, but the
MatchTreeSearch change is a big deal since that is time spent searching after
analysing a query.
Firstly we use bytes.IndexByte for faster newLinesIndices. On my machine this
reduces wall clock time of BenchmarkTagsToSections by 38%. This is faster
since bytes.IndexByte relies on CPU specific optimizations to find the next
new line (eg uses AVX2 if available).
Secondly we reuse nls slice between calls to tagsToSections. I noticed in the
profiler a nonsignificant chunk in the garbage collector. The slice built by
newLinesIndices is allocated and thrown away for each call to tagsToSections.
This means we can re-use it which this commit implements by introducing a
struct storing the buffer. We now use this buffer per shard of symbols we
analyse.
old time/op new time/op delta
188µs ± 7% 101µs ± 3% -46.10% (p=0.000 n=10+10)
old alloc/op new alloc/op delta
79.3kB ± 0% 36.3kB ± 0% -54.24% (p=0.000 n=9+10)
old allocs/op new allocs/op delta
443 ± 0% 441 ± 0% -0.45% (p=0.000 n=10+10)
Test Plan: go test -bench BenchmarkTagsToSections
I found the code a bit hard to read before, this I believe is more
clear. I was also hoping for an improvement in the benchmarks, but the
improvement was statistically insignificant.
Test Plan: go test
This change adds a benchmark for the conversion from ctags output to Zoekt
document data, plus a tiny optimization to presize the symbol slices.
This adds a monitor which will report every minute the progress of
symbol analysis. Additionally, if a document is taking too long to
analyse (10s) we report it.
At first this is just reporting via stdlog. However, once we are
comfortable with thresholds around this we can likely also include a way
to kill analysis for a file.
Test Plan: Adjusted monitorReportStatus to 1s then indexed the
sourcegraph repo and inspected the output
$ go run ./cmd/zoekt-git-index -require_ctags ../sourcegraph/
2023/11/03 16:03:10 attempting to index 14533 total files
2023/11/03 16:03:13 DEBUG: symbol analysis still running for shard statistics: duration=1s symbols=15805 bytes=44288971
2023/11/03 16:03:14 DEBUG: symbol analysis still running for shard statistics: duration=2s symbols=26189 bytes=51564417
2023/11/03 16:03:15 DEBUG: symbol analysis still running for shard statistics: duration=3s symbols=55613 bytes=64748084
2023/11/03 16:03:16 DEBUG: symbol analysis still running for shard statistics: duration=4s symbols=86557 bytes=93771404
2023/11/03 16:03:17 DEBUG: symbol analysis still running for shard statistics: duration=5s symbols=125352 bytes=116319453
2023/11/03 16:03:18 symbol analysis finished for shard statistics: duration=5s symbols=142951 bytes=129180023
2023/11/03 16:03:22 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 283983298 index bytes (overhead 2.8), 14533 files processed
I then added a random sleep for a minute in a file to see the stuck
reporting:
$ go run ./cmd/zoekt-git-index -require_ctags ../sourcegraph/
2023/11/03 16:14:57 attempting to index 14533 total files
2023/11/03 16:15:15 WARN: symbol analysis for README.md (3485 bytes) has been running for 14s
2023/11/03 16:15:25 WARN: symbol analysis for README.md (3485 bytes) has been running for 24s
2023/11/03 16:15:45 WARN: symbol analysis for README.md (3485 bytes) has been running for 44s
2023/11/03 16:16:00 DEBUG: symbol analysis still running for shard statistics: duration=1m0s symbols=958 bytes=624329
2023/11/03 16:16:00 symbol analysis for README.md (size 3485 bytes) is done and found 4 symbols
2023/11/03 16:16:06 symbol analysis finished for shard statistics: duration=1m5s symbols=142951 bytes=129180023
2023/11/03 16:16:10 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 283983299 index bytes (overhead 2.8), 14533 files processed
This change refactors our end-to-end scoring tests and enables local testing
using the scip-ctags binary:
* Split scoring tests out of `e2e_test` and into their own file `scoring_test`
* Split huge test methods into targeted ones like `TestFileNameMatch`,
`TestJava`, `TestGo`, etc.
* For languages that scip-ctags supports, rerun the same cases using the
scip-ctags binary
To run scip-ctags tests locally, you can set the env variable
```
SCIP_CTAGS_COMMAND=<sourcegraph-repo>/dev/scip-ctags-dev
```
This doesn't yet update Zoekt CI to run scip-ctags tests. That will be tackled
in a follow-up.
On large repos, indexing might take quite a while and hit the indexing timeout.
This change helps debug these situations:
* Make the indexing timeout configurable through an env variable
`INDEXING_TIMEOUT`
* Add more info to progress logging: log the total number of files being
indexed, plus the file count per shard
SCIP ctags can output different kind names than universal-ctags (for example
`typeAlias` instead of `talias`). This change makes sure we handle different
names for the same kind. To do so, it refactors the logic so we first match
strings to standard kinds, then decide how these are scored for each language.
That way, you don't need to remember to cover all the possible kind names each
time you adjust scoring for a new language.
Also added basic tests for Ruby and Python to ensure we don't accidentally
change the scoring.
Right now our symbol analyser doesn't tell us if a symbol is exported. We add
a go specific tweak here to boost those results. Ideally this could be
something that is encoded in the symbol information.
Additionally we do downrank _test.go files via the doc-order. But in the case
of symbol matches the boosting overweighs doc order signficantly. I found the
extra downraking quite useful when experimenting.
Test Plan: lots of manual testing on the keyword branch
Right now we boost a file extension that hasn't been seen to the 3rd
position. This is gated by an environment variable which defaults to
on. I want to explore if there are ways we can turn on this behaviour
with the query language.
Test Plan: go run ./cmd/zoekt foo
Tiny change to use a type def, as we usually prefer those over type aliases.
We bump go-ctags to include markdown files in symbol analysis. We
already have ranking behaviour for markdown files but due to go-ctags it
won't be taken into account (until this commit).
Additionally we bump the version included in the docker container to
v6.0.0.
Test Plan: CI and "docker build -t zoekt ."
We add a bit more complexity to addScore but avoid allocating a string if debugScore = false.
Test plan:
I ran a couple of benchmarks and confirmed we don't allocate.
Relates to #667.
This updates the expected scores, now that we have removed the repetition boost.
Tiny change to display the atom count. I caught myself several times trying to
translate the score to an atom count.
Test plan:
Ran it locally
I don't remember if there was a good reason to score methods and
functions differently, but I have found examples where the difference
between the two scores causes the better result to be lower ranked,
although it has better file-order boost.
Scip-ctags distinguishes between methods and functions, universal-ctags
does not.
Test plan:
updated score tests
The definition of how this is applied is very narrow and more often than
not works poorly. Originally I sent out a commit to dampen this using
log, but Julie suggested just removing which sounds better.
Test Plan: go test
Because "type" in scip-ctags matches so much we can't boost it as high
as we do for classes/structs in our language specific boosting. In the
case of GraphQL this is useful to boost.
Test Plan: A search like "type User" in the sourcegraph codebase now
includes results from the schema.graphql files.
Very useful for quick iteration on ranking work.
Test Plan: go run ./cmd/zoekt -debug foo
* temporarily use log branch
* logging: remove description param in Scoped
* use latest log and mountinfo versions
There are likely more that we are missing. We only briefly looked at
golang, likely other languages are also affected. The regression is
scip-ctags returning different values for kind.
Will follow-up with a more comprehensive workaround until scip-ctags
issues are resolved.
Test Plan: manually tested queries like "test server" against the golang
repo until the test server struct was the top result.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
This field is no longer requested in Sourcegraph and has been deprecated
for a while.
Test Plan: go test ./...
Previously we enforced the binary was called universal-ctags. However,
we let users override the name of the binary, so if they override it we
should use it. To prevent footguns, we now validate the binary was built
with the interactive feature.
In the case of scip-ctags we use the old validation of just checking the
name.
Additionally we fix a bug that was introduced where if symbols are
optional we continue if parsing fails.
Test Plan: indexed with and without universal-ctags. Ctags on my mbp
points to something from xcode which wouldn't work
$ CTAGS_COMMAND=ctags go run ./cmd/zoekt-git-index -require_ctags .
2023/10/04 17:08:12 indexGitRepo(/Users/keegan/src/github.com/sourcegraph/zoekt, delta=false): build.NewBuilder: ctags.NewParserMap: ctags binary is not universal-ctags or is not compiled with +interactive feature: bin=ctags
exit status 1
$ CTAGS_COMMAND=universal-ctags go run ./cmd/zoekt-git-index -require_ctags .
2023/10/04 17:08:29 finished github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt: 8657338 index bytes (overhead 2.9)
$ CTAGS_COMMAND=ctags go run ./cmd/zoekt-git-index .
2023/10/04 17:08:40 finished github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt: 8538246 index bytes (overhead 2.9)
Right now we use panicf which leads a stack trace which is misleading at
what is happening and fills up the space used by kubernetes error
reporting. Additionally a few times we have had bug reports about the
watchdog failing. This commit updates the message to be far more
informative about next steps.
Additionally we update the watchdog error to include the response body
in case that contains useful information for debugging.
Test Plan: Updated the serveHealthz handler to always return an error.
Then ran the following
$ ZOEKT_WATCHDOG_TICK=1s go run ./cmd/zoekt-webserver
2023/09/14 15:55:27 custom ZOEKT_WATCHDOG_TICK=1s
2023/09/14 15:55:27 loading 1 shard(s): github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt
2023/09/14 15:55:28 watchdog: failed, will try 2 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:29 watchdog: failed, will try 1 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:30 watchdog health check has consecutively failed 3 times indicating is likely an unrecoverable error affecting zoekt. As such this process will exit with code 3.
Final error: watchdog: status=500 body="not ready: boom\n"
Possible Remediations:
- If this rarely happens, ignore and let your process manager restart zoekt.
- Possibly under provisioned. Try increasing CPU or disk IO.
- A bug. Reach out with logs and screenshots of metrics when this occurs.
exit status 3
This probably caused our tmp-dirs to run full. In production we saw data
from years ago taking up disk space.
Co-authored-by: Petri-Johan Last <petri.last@sourcegraph.com>
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
This was a nice to have when we investigated search times. A lot of the
stats we have don't really capture slow downs due to IO/etc, so a rough
time based metric will be useful to understand how much time we spend
actually searching vs deciding if we should search.
Test Plan: go test ./...
If the doc limit was triggered, the match limit was never evaluated,
even if it could have further reduced the FileMatches.
This moves all decisions about if we should do display limits and how to
enforce them into one place. By default we make it stateful so it can be
used by the limitSender. But in every other place we will always sort
and then truncate, so we also introduce a convenience API for that.
The only potential behaviour change here is if you have no display
limits but had document ranks turned on, we would not sort matches
within a shard. However, the aggregator would still do sorting anyways
so this is really just a wasted sort. In practice a display limit should
always be set.
Test Plan: go test
This is an initial framework for having golden file results for search
results against a real repository. At first we have only added one query
and one repository, but it should be straightforward to grow this list
further.
The golden files we write to disk are a summary of results. This matches how
we have been using the zoekt CLI tool on the keyword branch during our ranking
work.
Test Plan: go test
By default, `zoekt-sourcegraph-indexserver` builds one repo at a time. For each
repo, shards are built in parallel using a number of threads equal to available
CPUs. There are two ways to adjust the indexing concurrency:
1. Passing `cpu_fraction`, which limits the available CPUs for parallel shard
building
2. Passing `index_concurrency` (or setting the `SRC_INDEX_CONCURRENCY`
environment variable), to index more than one repo at once
If you set `index_concurrency` to some number greater than 1, then indexing
will use more threads than available CPUs. This seems undesirable, especially
if you set `cpu_fraction`, since you'd expect that to put an upper bound on CPU
usage.
This changes the shard-level parallelism to `available CPUs / index_concurrency`
(rounded down), to bound the CPU usage as expected.
Previously, we didn't require that scip-ctags be available even if if you set
`require_ctags` and set `language_map` to something like `go:scip`. Instead, we
silently fell back to `universal-ctags`. This is tricky and could mask a real
issue where we expect scip-ctags to be available but it isn't.
Now, we check if SCIP is needed based on `language_map`, and if so require that
scip-ctags is available.
From reading the implementation recently I noticed that we should always
go via evalMatchTree to ensure we correctly use the known map for
caching. This updates the two call sites we didn't do this as well as
documenting this requirement.
I don't think this caused any noticeable performance issues. We would do
slightly more work in the case the root matchTree was an and, but it
would have been miniscule.
Test Plan: go test
The pair of bools (matches, sure) often was quite hard to reason about.
I think this came down to them often not being named at return sites,
the names itself being too concise and that two booleans represents 4
possible states, but we only had two possible states.
This commit uses a more verbose enum instead of booleans. From reading
the diff back I find the code easier to reason about, so I think this is
a good change.
Test Plan: go test ./...
universal-ctags sometimes returns lines that are out of bounds. In
practice it seems to only do an off by one. We haven't noticed the
linenum error until a recent change of mine which didn't append an extra
entry to NLS if the file was terminated by "\n". In practice this would
end up being filtered out later on. So we update to just continue rather
than error here.
An example is
https://github.com/sourcegraph/sourcegraph/blob/v5.2.2/client/web-sveltekit/.storybook/main.ts
$ universal-ctags '--fields=*' --output-format=json main.ts | grep 22
{"_type": "tag", "name": "config", "path": "main.ts", "pattern": "/^export default config$/", "language": "TypeScript", "line": 22, "kind": "constant", "roles": "def"}
$ wc -l main.ts
21 main.ts
$ tail -n1 main.ts
export default config
Test Plan: added a unit test
When indexing, we build shards in parallel based on the `parallelism` flag.
Each shard handles ~100MB of document contents, which should limit the memory
usage to roughly `100MB * parallelism`.
Looking at the size of the buffered document contents in memory profiles, we
see much higher usage than this. The issue seems to be that we continue to
buffer up documents even if all threads are busy building shards. This can be a
real problem if shards take a super long time to build (say because ctags is
slow) -- we could end up buffering a ton of content into memory at once.
This change fixes the throttling logic so we block indexing when all threads
are busy building shards.
This change makes a couple small improvements to how we handle skipped docs:
* Immediately skip ctags parsing if the content is `nil`
* Always sort skipped docs to the end of the shard. This seems like a nice
invariant. And generally it's good for performance to group data that is
expected to be accessed together and has similar content.
When indexing documents, we buffer up documents until we reach the shard size
limit (100MB), then flush the shard. If we decide to skip a document because
it's a binary file, then (naturally) we don't count its content size towards
the shard limit. But we still buffered the full document. So if there are a large
number of binary files, we could easily blow past the 100MB limit and run into
memory issues.
This change simply clears `Content` whenever `SkipReason` is set. The
invariant: a buffered document should only ever have `SkipReason` or `Content`,
not both.
This was a huge oversight that has lived in our codebase since we
introduced symbolRegexpMatchTree. Because we don't call prepare, we
don't correctly use the index for symbol regex queries. From some local
testing this makes a huge difference to performance.
Huge shout-out to @camdencheek who spotted this.
Test Plan: validated with some local searches that results remain the same and
that the statistics for the searches go up for IndexBytesLoaded, but go down
for ContentBytesLoaded, FilesConsidered, FilesLoaded, etc. Added unit tests
which assert the index is used. Also perf tested with hyperfine.
Hyperfine results:
Benchmark 1: ./zoekt-before -sym '^searcher$'
Time (mean ± σ): 93.0 ms ± 1.2 ms [User: 142.2 ms, System: 18.9 ms]
Range (min … max): 90.8 ms … 95.6 ms 31 runs
Benchmark 2: ./zoekt-after -sym '^searcher$'
Time (mean ± σ): 52.3 ms ± 0.5 ms [User: 76.3 ms, System: 13.0 ms]
Range (min … max): 50.7 ms … 53.4 ms 53 runs
Summary
'./zoekt-after -sym '^searcher$'' ran
1.78 ± 0.03 times faster than './zoekt-before -sym '^searcher$''
For that search, a random comparison of the zoekt stats:
| Stat | Before | After | Delta |
|---------------------- |---------- |--------- |----------- |
| ContentBytesLoaded | 199007382 | 22566033 | -176441349 |
| IndexBytesLoaded | 3527 | 165645 | 162118 |
| Crashes | 0 | 0 | 0 |
| Duration | 57956167 | 17568708 | -40387459 |
| FileCount | 28 | 28 | 0 |
| ShardFilesConsidered | 0 | 0 | 0 |
| FilesConsidered | 28477 | 766 | -27711 |
| FilesLoaded | 28477 | 766 | -27711 |
| FilesSkipped | 0 | 0 | 0 |
| ShardsScanned | 5 | 5 | 0 |
| ShardsSkipped | 0 | 0 | 0 |
| ShardsSkippedFilter | 0 | 0 | 0 |
| MatchCount | 29 | 29 | 0 |
| NgramMatches | 87 | 4407 | 4320 |
| NgramLookups | 644 | 644 | 0 |
| Wait | 5792 | 11500 | 5708 |
| MatchTreeConstruction | 498042 | 515248 | 17206 |
| MatchTreeSearch | 97661875 | 23089418 | -74572457 |
Analysis: An absolutely massive reduction in the number of files we consider.
This means we are actually using the index properly. eg look at
ContentBytesLoaded, Duration, FilesConsidered, FilesLoaded. You can also see
that IndexBytesLoaded has gone up since we now use it properly. This was on a
small corpus so will have huge impact in production.
Note that the random changes Wait, MatchTreeConstruction are random, but the
MatchTreeSearch change is a big deal since that is time spent searching after
analysing a query.
Firstly we use bytes.IndexByte for faster newLinesIndices. On my machine this
reduces wall clock time of BenchmarkTagsToSections by 38%. This is faster
since bytes.IndexByte relies on CPU specific optimizations to find the next
new line (eg uses AVX2 if available).
Secondly we reuse nls slice between calls to tagsToSections. I noticed in the
profiler a nonsignificant chunk in the garbage collector. The slice built by
newLinesIndices is allocated and thrown away for each call to tagsToSections.
This means we can re-use it which this commit implements by introducing a
struct storing the buffer. We now use this buffer per shard of symbols we
analyse.
old time/op new time/op delta
188µs ± 7% 101µs ± 3% -46.10% (p=0.000 n=10+10)
old alloc/op new alloc/op delta
79.3kB ± 0% 36.3kB ± 0% -54.24% (p=0.000 n=9+10)
old allocs/op new allocs/op delta
443 ± 0% 441 ± 0% -0.45% (p=0.000 n=10+10)
Test Plan: go test -bench BenchmarkTagsToSections
This adds a monitor which will report every minute the progress of
symbol analysis. Additionally, if a document is taking too long to
analyse (10s) we report it.
At first this is just reporting via stdlog. However, once we are
comfortable with thresholds around this we can likely also include a way
to kill analysis for a file.
Test Plan: Adjusted monitorReportStatus to 1s then indexed the
sourcegraph repo and inspected the output
$ go run ./cmd/zoekt-git-index -require_ctags ../sourcegraph/
2023/11/03 16:03:10 attempting to index 14533 total files
2023/11/03 16:03:13 DEBUG: symbol analysis still running for shard statistics: duration=1s symbols=15805 bytes=44288971
2023/11/03 16:03:14 DEBUG: symbol analysis still running for shard statistics: duration=2s symbols=26189 bytes=51564417
2023/11/03 16:03:15 DEBUG: symbol analysis still running for shard statistics: duration=3s symbols=55613 bytes=64748084
2023/11/03 16:03:16 DEBUG: symbol analysis still running for shard statistics: duration=4s symbols=86557 bytes=93771404
2023/11/03 16:03:17 DEBUG: symbol analysis still running for shard statistics: duration=5s symbols=125352 bytes=116319453
2023/11/03 16:03:18 symbol analysis finished for shard statistics: duration=5s symbols=142951 bytes=129180023
2023/11/03 16:03:22 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 283983298 index bytes (overhead 2.8), 14533 files processed
I then added a random sleep for a minute in a file to see the stuck
reporting:
$ go run ./cmd/zoekt-git-index -require_ctags ../sourcegraph/
2023/11/03 16:14:57 attempting to index 14533 total files
2023/11/03 16:15:15 WARN: symbol analysis for README.md (3485 bytes) has been running for 14s
2023/11/03 16:15:25 WARN: symbol analysis for README.md (3485 bytes) has been running for 24s
2023/11/03 16:15:45 WARN: symbol analysis for README.md (3485 bytes) has been running for 44s
2023/11/03 16:16:00 DEBUG: symbol analysis still running for shard statistics: duration=1m0s symbols=958 bytes=624329
2023/11/03 16:16:00 symbol analysis for README.md (size 3485 bytes) is done and found 4 symbols
2023/11/03 16:16:06 symbol analysis finished for shard statistics: duration=1m5s symbols=142951 bytes=129180023
2023/11/03 16:16:10 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 283983299 index bytes (overhead 2.8), 14533 files processed
This change refactors our end-to-end scoring tests and enables local testing
using the scip-ctags binary:
* Split scoring tests out of `e2e_test` and into their own file `scoring_test`
* Split huge test methods into targeted ones like `TestFileNameMatch`,
`TestJava`, `TestGo`, etc.
* For languages that scip-ctags supports, rerun the same cases using the
scip-ctags binary
To run scip-ctags tests locally, you can set the env variable
```
SCIP_CTAGS_COMMAND=<sourcegraph-repo>/dev/scip-ctags-dev
```
This doesn't yet update Zoekt CI to run scip-ctags tests. That will be tackled
in a follow-up.
On large repos, indexing might take quite a while and hit the indexing timeout.
This change helps debug these situations:
* Make the indexing timeout configurable through an env variable
`INDEXING_TIMEOUT`
* Add more info to progress logging: log the total number of files being
indexed, plus the file count per shard
SCIP ctags can output different kind names than universal-ctags (for example
`typeAlias` instead of `talias`). This change makes sure we handle different
names for the same kind. To do so, it refactors the logic so we first match
strings to standard kinds, then decide how these are scored for each language.
That way, you don't need to remember to cover all the possible kind names each
time you adjust scoring for a new language.
Also added basic tests for Ruby and Python to ensure we don't accidentally
change the scoring.
Right now our symbol analyser doesn't tell us if a symbol is exported. We add
a go specific tweak here to boost those results. Ideally this could be
something that is encoded in the symbol information.
Additionally we do downrank _test.go files via the doc-order. But in the case
of symbol matches the boosting overweighs doc order signficantly. I found the
extra downraking quite useful when experimenting.
Test Plan: lots of manual testing on the keyword branch
We bump go-ctags to include markdown files in symbol analysis. We
already have ranking behaviour for markdown files but due to go-ctags it
won't be taken into account (until this commit).
Additionally we bump the version included in the docker container to
v6.0.0.
Test Plan: CI and "docker build -t zoekt ."
I don't remember if there was a good reason to score methods and
functions differently, but I have found examples where the difference
between the two scores causes the better result to be lower ranked,
although it has better file-order boost.
Scip-ctags distinguishes between methods and functions, universal-ctags
does not.
Test plan:
updated score tests
There are likely more that we are missing. We only briefly looked at
golang, likely other languages are also affected. The regression is
scip-ctags returning different values for kind.
Will follow-up with a more comprehensive workaround until scip-ctags
issues are resolved.
Test Plan: manually tested queries like "test server" against the golang
repo until the test server struct was the top result.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
Previously we enforced the binary was called universal-ctags. However,
we let users override the name of the binary, so if they override it we
should use it. To prevent footguns, we now validate the binary was built
with the interactive feature.
In the case of scip-ctags we use the old validation of just checking the
name.
Additionally we fix a bug that was introduced where if symbols are
optional we continue if parsing fails.
Test Plan: indexed with and without universal-ctags. Ctags on my mbp
points to something from xcode which wouldn't work
$ CTAGS_COMMAND=ctags go run ./cmd/zoekt-git-index -require_ctags .
2023/10/04 17:08:12 indexGitRepo(/Users/keegan/src/github.com/sourcegraph/zoekt, delta=false): build.NewBuilder: ctags.NewParserMap: ctags binary is not universal-ctags or is not compiled with +interactive feature: bin=ctags
exit status 1
$ CTAGS_COMMAND=universal-ctags go run ./cmd/zoekt-git-index -require_ctags .
2023/10/04 17:08:29 finished github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt: 8657338 index bytes (overhead 2.9)
$ CTAGS_COMMAND=ctags go run ./cmd/zoekt-git-index .
2023/10/04 17:08:40 finished github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt: 8538246 index bytes (overhead 2.9)
Right now we use panicf which leads a stack trace which is misleading at
what is happening and fills up the space used by kubernetes error
reporting. Additionally a few times we have had bug reports about the
watchdog failing. This commit updates the message to be far more
informative about next steps.
Additionally we update the watchdog error to include the response body
in case that contains useful information for debugging.
Test Plan: Updated the serveHealthz handler to always return an error.
Then ran the following
$ ZOEKT_WATCHDOG_TICK=1s go run ./cmd/zoekt-webserver
2023/09/14 15:55:27 custom ZOEKT_WATCHDOG_TICK=1s
2023/09/14 15:55:27 loading 1 shard(s): github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt
2023/09/14 15:55:28 watchdog: failed, will try 2 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:29 watchdog: failed, will try 1 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:30 watchdog health check has consecutively failed 3 times indicating is likely an unrecoverable error affecting zoekt. As such this process will exit with code 3.
Final error: watchdog: status=500 body="not ready: boom\n"
Possible Remediations:
- If this rarely happens, ignore and let your process manager restart zoekt.
- Possibly under provisioned. Try increasing CPU or disk IO.
- A bug. Reach out with logs and screenshots of metrics when this occurs.
exit status 3
This moves all decisions about if we should do display limits and how to
enforce them into one place. By default we make it stateful so it can be
used by the limitSender. But in every other place we will always sort
and then truncate, so we also introduce a convenience API for that.
The only potential behaviour change here is if you have no display
limits but had document ranks turned on, we would not sort matches
within a shard. However, the aggregator would still do sorting anyways
so this is really just a wasted sort. In practice a display limit should
always be set.
Test Plan: go test