matchtree: call prepare on symbolRegexpMatchTree subtree (#685)
This was a huge oversight that has lived in our codebase since we
introduced symbolRegexpMatchTree. Because we don't call prepare, we
don't correctly use the index for symbol regex queries. From some local
testing this makes a huge difference to performance.
Huge shout-out to @camdencheek who spotted this.
Test Plan: validated with some local searches that results remain the same and
that the statistics for the searches go up for IndexBytesLoaded, but go down
for ContentBytesLoaded, FilesConsidered, FilesLoaded, etc. Added unit tests
which assert the index is used. Also perf tested with hyperfine.
Hyperfine results:
Benchmark 1: ./zoekt-before -sym '^searcher$'
Time (mean ± σ): 93.0 ms ± 1.2 ms [User: 142.2 ms, System: 18.9 ms]
Range (min … max): 90.8 ms … 95.6 ms 31 runs
Benchmark 2: ./zoekt-after -sym '^searcher$'
Time (mean ± σ): 52.3 ms ± 0.5 ms [User: 76.3 ms, System: 13.0 ms]
Range (min … max): 50.7 ms … 53.4 ms 53 runs
Summary
'./zoekt-after -sym '^searcher$'' ran
1.78 ± 0.03 times faster than './zoekt-before -sym '^searcher$''
For that search, a random comparison of the zoekt stats:
| Stat | Before | After | Delta |
|---------------------- |---------- |--------- |----------- |
| ContentBytesLoaded | 199007382 | 22566033 | -176441349 |
| IndexBytesLoaded | 3527 | 165645 | 162118 |
| Crashes | 0 | 0 | 0 |
| Duration | 57956167 | 17568708 | -40387459 |
| FileCount | 28 | 28 | 0 |
| ShardFilesConsidered | 0 | 0 | 0 |
| FilesConsidered | 28477 | 766 | -27711 |
| FilesLoaded | 28477 | 766 | -27711 |
| FilesSkipped | 0 | 0 | 0 |
| ShardsScanned | 5 | 5 | 0 |
| ShardsSkipped | 0 | 0 | 0 |
| ShardsSkippedFilter | 0 | 0 | 0 |
| MatchCount | 29 | 29 | 0 |
| NgramMatches | 87 | 4407 | 4320 |
| NgramLookups | 644 | 644 | 0 |
| Wait | 5792 | 11500 | 5708 |
| MatchTreeConstruction | 498042 | 515248 | 17206 |
| MatchTreeSearch | 97661875 | 23089418 | -74572457 |
Analysis: An absolutely massive reduction in the number of files we consider.
This means we are actually using the index properly. eg look at
ContentBytesLoaded, Duration, FilesConsidered, FilesLoaded. You can also see
that IndexBytesLoaded has gone up since we now use it properly. This was on a
small corpus so will have huge impact in production.
Note that the random changes Wait, MatchTreeConstruction are random, but the
MatchTreeSearch change is a big deal since that is time spent searching after
analysing a query.