Commits

This was a huge oversight that has lived in our codebase since we
introduced symbolRegexpMatchTree. Because we don't call prepare, we
don't correctly use the index for symbol regex queries. From some local
testing this makes a huge difference to performance.

Huge shout-out to @camdencheek who spotted this.

Test Plan: validated with some local searches that results remain the same and
that the statistics for the searches go up for IndexBytesLoaded, but go down
for ContentBytesLoaded, FilesConsidered, FilesLoaded, etc. Added unit tests
which assert the index is used. Also perf tested with hyperfine.

Hyperfine results:

Benchmark 1: ./zoekt-before -sym '^searcher$'
Time (mean ± σ): 93.0 ms ± 1.2 ms [User: 142.2 ms, System: 18.9 ms]
Range (min … max): 90.8 ms … 95.6 ms 31 runs

Benchmark 2: ./zoekt-after -sym '^searcher$'
Time (mean ± σ): 52.3 ms ± 0.5 ms [User: 76.3 ms, System: 13.0 ms]
Range (min … max): 50.7 ms … 53.4 ms 53 runs

Summary
'./zoekt-after -sym '^searcher$'' ran
1.78 ± 0.03 times faster than './zoekt-before -sym '^searcher$''

For that search, a random comparison of the zoekt stats:

| Stat | Before | After | Delta |
|---------------------- |---------- |--------- |----------- |
| ContentBytesLoaded | 199007382 | 22566033 | -176441349 |
| IndexBytesLoaded | 3527 | 165645 | 162118 |
| Crashes | 0 | 0 | 0 |
| Duration | 57956167 | 17568708 | -40387459 |
| FileCount | 28 | 28 | 0 |
| ShardFilesConsidered | 0 | 0 | 0 |
| FilesConsidered | 28477 | 766 | -27711 |
| FilesLoaded | 28477 | 766 | -27711 |
| FilesSkipped | 0 | 0 | 0 |
| ShardsScanned | 5 | 5 | 0 |
| ShardsSkipped | 0 | 0 | 0 |
| ShardsSkippedFilter | 0 | 0 | 0 |
| MatchCount | 29 | 29 | 0 |
| NgramMatches | 87 | 4407 | 4320 |
| NgramLookups | 644 | 644 | 0 |
| Wait | 5792 | 11500 | 5708 |
| MatchTreeConstruction | 498042 | 515248 | 17206 |
| MatchTreeSearch | 97661875 | 23089418 | -74572457 |

Analysis: An absolutely massive reduction in the number of files we consider.
This means we are actually using the index properly. eg look at
ContentBytesLoaded, Duration, FilesConsidered, FilesLoaded. You can also see
that IndexBytesLoaded has gone up since we now use it properly. This was on a
small corpus so will have huge impact in production.

Note that the random changes Wait, MatchTreeConstruction are random, but the
MatchTreeSearch change is a big deal since that is time spent searching after
analysing a query.

2y ago

Keegan Carruthers-Smith

bec12a77

build: faster newLinesIndices via bytes.IndexByte and buffer re-use (#680)

2y ago

Keegan Carruthers-Smith

47d620ab

build: use slices.Insert instead of several appends (#681)

2y ago

Julie Tibshirani

dc41c6e3

Add benchmark for ctags conversion (#679)

2y ago

Keegan Carruthers-Smith

0ff0dd58

ctags: monitor symbol analysis and report stuck documents (#678)

This adds a monitor which will report every minute the progress of
symbol analysis. Additionally, if a document is taking too long to
analyse (10s) we report it.

At first this is just reporting via stdlog. However, once we are
comfortable with thresholds around this we can likely also include a way
to kill analysis for a file.

Test Plan: Adjusted monitorReportStatus to 1s then indexed the
sourcegraph repo and inspected the output

$ go run ./cmd/zoekt-git-index -require_ctags ../sourcegraph/
2023/11/03 16:03:10 attempting to index 14533 total files
2023/11/03 16:03:13 DEBUG: symbol analysis still running for shard statistics: duration=1s symbols=15805 bytes=44288971
2023/11/03 16:03:14 DEBUG: symbol analysis still running for shard statistics: duration=2s symbols=26189 bytes=51564417
2023/11/03 16:03:15 DEBUG: symbol analysis still running for shard statistics: duration=3s symbols=55613 bytes=64748084
2023/11/03 16:03:16 DEBUG: symbol analysis still running for shard statistics: duration=4s symbols=86557 bytes=93771404
2023/11/03 16:03:17 DEBUG: symbol analysis still running for shard statistics: duration=5s symbols=125352 bytes=116319453
2023/11/03 16:03:18 symbol analysis finished for shard statistics: duration=5s symbols=142951 bytes=129180023
2023/11/03 16:03:22 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 283983298 index bytes (overhead 2.8), 14533 files processed

I then added a random sleep for a minute in a file to see the stuck
reporting:

$ go run ./cmd/zoekt-git-index -require_ctags ../sourcegraph/
2023/11/03 16:14:57 attempting to index 14533 total files
2023/11/03 16:15:15 WARN: symbol analysis for README.md (3485 bytes) has been running for 14s
2023/11/03 16:15:25 WARN: symbol analysis for README.md (3485 bytes) has been running for 24s
2023/11/03 16:15:45 WARN: symbol analysis for README.md (3485 bytes) has been running for 44s
2023/11/03 16:16:00 DEBUG: symbol analysis still running for shard statistics: duration=1m0s symbols=958 bytes=624329
2023/11/03 16:16:00 symbol analysis for README.md (size 3485 bytes) is done and found 4 symbols
2023/11/03 16:16:06 symbol analysis finished for shard statistics: duration=1m5s symbols=142951 bytes=129180023
2023/11/03 16:16:10 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 283983299 index bytes (overhead 2.8), 14533 files processed