fork of https://github.com/sourcegraph/zoekt
0

Configure Feed

Select the types of activity you want to include in your feed.

matchtree: call prepare on symbolRegexpMatchTree subtree (#685)

This was a huge oversight that has lived in our codebase since we
introduced symbolRegexpMatchTree. Because we don't call prepare, we
don't correctly use the index for symbol regex queries. From some local
testing this makes a huge difference to performance.

Huge shout-out to @camdencheek who spotted this.

Test Plan: validated with some local searches that results remain the same and
that the statistics for the searches go up for IndexBytesLoaded, but go down
for ContentBytesLoaded, FilesConsidered, FilesLoaded, etc. Added unit tests
which assert the index is used. Also perf tested with hyperfine.

Hyperfine results:

Benchmark 1: ./zoekt-before -sym '^searcher$'
Time (mean ± σ): 93.0 ms ± 1.2 ms [User: 142.2 ms, System: 18.9 ms]
Range (min … max): 90.8 ms … 95.6 ms 31 runs

Benchmark 2: ./zoekt-after -sym '^searcher$'
Time (mean ± σ): 52.3 ms ± 0.5 ms [User: 76.3 ms, System: 13.0 ms]
Range (min … max): 50.7 ms … 53.4 ms 53 runs

Summary
'./zoekt-after -sym '^searcher$'' ran
1.78 ± 0.03 times faster than './zoekt-before -sym '^searcher$''

For that search, a random comparison of the zoekt stats:

| Stat | Before | After | Delta |
|---------------------- |---------- |--------- |----------- |
| ContentBytesLoaded | 199007382 | 22566033 | -176441349 |
| IndexBytesLoaded | 3527 | 165645 | 162118 |
| Crashes | 0 | 0 | 0 |
| Duration | 57956167 | 17568708 | -40387459 |
| FileCount | 28 | 28 | 0 |
| ShardFilesConsidered | 0 | 0 | 0 |
| FilesConsidered | 28477 | 766 | -27711 |
| FilesLoaded | 28477 | 766 | -27711 |
| FilesSkipped | 0 | 0 | 0 |
| ShardsScanned | 5 | 5 | 0 |
| ShardsSkipped | 0 | 0 | 0 |
| ShardsSkippedFilter | 0 | 0 | 0 |
| MatchCount | 29 | 29 | 0 |
| NgramMatches | 87 | 4407 | 4320 |
| NgramLookups | 644 | 644 | 0 |
| Wait | 5792 | 11500 | 5708 |
| MatchTreeConstruction | 498042 | 515248 | 17206 |
| MatchTreeSearch | 97661875 | 23089418 | -74572457 |

Analysis: An absolutely massive reduction in the number of files we consider.
This means we are actually using the index properly. eg look at
ContentBytesLoaded, Duration, FilesConsidered, FilesLoaded. You can also see
that IndexBytesLoaded has gone up since we now use it properly. This was on a
small corpus so will have huge impact in production.

Note that the random changes Wait, MatchTreeConstruction are random, but the
MatchTreeSearch change is a big deal since that is time spent searching after
analysing a query.

+121 -12
+29 -5
cmd/zoekt/main.go
··· 130 130 }) 131 131 } 132 132 133 + // experimental support for symbol queries. We just convert substring queries 134 + // into symbol queries. Needs to run after query.ExpandFileContent 135 + func toSymbolQuery(q query.Q) query.Q { 136 + return query.Map(q, func(q query.Q) query.Q { 137 + switch s := q.(type) { 138 + case *query.Substring: 139 + if s.Content { 140 + return &query.Symbol{Expr: s} 141 + } 142 + case *query.Regexp: 143 + if s.Content { 144 + return &query.Symbol{Expr: s} 145 + } 146 + } 147 + return q 148 + }) 149 + } 150 + 133 151 func main() { 134 152 shard := flag.String("shard", "", "search in a specific shard") 135 153 index := flag.String("index_dir", ··· 141 159 verbose := flag.Bool("v", false, "print some background data") 142 160 withRepo := flag.Bool("r", false, "print the repo before the file name") 143 161 list := flag.Bool("l", false, "print matching filenames only") 162 + sym := flag.Bool("sym", false, "do experimental symbol search") 144 163 145 164 flag.Usage = func() { 146 165 name := os.Args[0] ··· 170 189 log.Fatal(err) 171 190 } 172 191 173 - query, err := query.Parse(pat) 192 + q, err := query.Parse(pat) 174 193 if err != nil { 175 194 log.Fatal(err) 176 195 } 196 + q = query.Map(q, query.ExpandFileContent) 197 + if *sym { 198 + q = toSymbolQuery(q) 199 + } 200 + q = query.Simplify(q) 177 201 if *verbose { 178 - log.Println("query:", query) 202 + log.Println("query:", q) 179 203 } 180 204 181 205 sOpts := zoekt.SearchOptions{ 182 206 DebugScore: *debug, 183 207 } 184 - sres, err := searcher.Search(context.Background(), query, &sOpts) 208 + sres, err := searcher.Search(context.Background(), q, &sOpts) 185 209 if err != nil { 186 210 log.Fatal(err) 187 211 } ··· 189 213 // If profiling, do it another time so we measure with 190 214 // warm caches. 191 215 for run := startCPUProfile(*cpuProfile, *profileTime); run(); { 192 - sres, _ = searcher.Search(context.Background(), query, &sOpts) 216 + sres, _ = searcher.Search(context.Background(), q, &sOpts) 193 217 } 194 218 for run := startFullProfile(*fullProfile, *profileTime); run(); { 195 - sres, _ = searcher.Search(context.Background(), query, &sOpts) 219 + sres, _ = searcher.Search(context.Background(), q, &sOpts) 196 220 } 197 221 198 222 displayMatches(sres.Files, pat, *withRepo, *list)
+90 -7
index_test.go
··· 369 369 }) 370 370 } 371 371 372 + // wordsAsSymbols finds all ASCII words in doc.Content which are longer than 2 373 + // chars. Those are then set as symbols. 374 + func wordsAsSymbols(doc Document) Document { 375 + re := regexp.MustCompile(`\b\w{2,}\b`) 376 + var symbols []DocumentSection 377 + for _, match := range re.FindAllIndex(doc.Content, -1) { 378 + symbols = append(symbols, DocumentSection{ 379 + Start: uint32(match[0]), 380 + End: uint32(match[1]), 381 + }) 382 + } 383 + doc.Symbols = symbols 384 + return doc 385 + } 386 + 372 387 func TestSearchStats(t *testing.T) { 373 388 ctx := context.Background() 374 389 searcher := searcherForTest(t, testIndexBuilder(t, nil, 375 - Document{Name: "f1", Content: []byte("x banana y")}, 376 - Document{Name: "f2", Content: []byte("x apple y")}, 377 - Document{Name: "f3", Content: []byte("x banana apple y")}, 378 - // -----------------------------------0123456789012345 390 + wordsAsSymbols(Document{Name: "f1", Content: []byte("x banana y")}), 391 + wordsAsSymbols(Document{Name: "f2", Content: []byte("x apple y")}), 392 + wordsAsSymbols(Document{Name: "f3", Content: []byte("x banana apple y")}), 393 + // --------------------------------------------------0123456789012345 379 394 )) 380 395 381 396 andQuery := query.NewAnd( ··· 425 440 Q: andQuery, 426 441 Want: Stats{ 427 442 FilesLoaded: 1, 428 - ContentBytesLoaded: 18, 443 + ContentBytesLoaded: 22, 429 444 IndexBytesLoaded: 8, 430 445 NgramMatches: 3, // we look at doc 1, because it's max(0,1) due to AND 431 446 NgramLookups: 104, ··· 442 457 CaseSensitive: true, 443 458 }, 444 459 Want: Stats{ 445 - ContentBytesLoaded: 12, 460 + ContentBytesLoaded: 14, 446 461 IndexBytesLoaded: 1, 447 462 FileCount: 1, 448 463 FilesConsidered: 1, ··· 459 474 Content: true, 460 475 }, 461 476 Want: Stats{ 462 - ContentBytesLoaded: 12, 477 + ContentBytesLoaded: 14, 463 478 IndexBytesLoaded: 1, 464 479 FileCount: 1, 465 480 FilesConsidered: 1, ··· 498 513 IndexBytesLoaded: 1, // we created an iterator for "a y" before pruning. 499 514 ShardsSkippedFilter: 1, 500 515 NgramLookups: 3, // we lookedup "foo" once (1), but lookedup and created "a y" (2). 516 + }, 517 + }, { 518 + Name: "symbol-substr-nomatch", 519 + Q: &query.Symbol{Expr: &query.Substring{ 520 + Pattern: "banana apple", 521 + Content: true, 522 + CaseSensitive: true, 523 + }}, 524 + Want: Stats{ 525 + IndexBytesLoaded: 3, 526 + FilesConsidered: 1, // important that we only check 1 file to ensure we are using the index 527 + MatchCount: 0, // even though there is a match it doesn't align with a symbol 528 + ShardsScanned: 1, 529 + NgramMatches: 1, 530 + NgramLookups: 12, 531 + }, 532 + }, { 533 + Name: "symbol-substr", 534 + Q: &query.Symbol{Expr: &query.Substring{ 535 + Pattern: "apple", 536 + Content: true, 537 + CaseSensitive: true, 538 + }}, 539 + Want: Stats{ 540 + ContentBytesLoaded: 35, 541 + IndexBytesLoaded: 4, 542 + FileCount: 2, 543 + FilesConsidered: 2, // must be 2 to ensure we used the index 544 + FilesLoaded: 2, 545 + MatchCount: 2, // apple symbols is in two files 546 + ShardsScanned: 1, 547 + NgramMatches: 2, 548 + NgramLookups: 5, 549 + }, 550 + }, { 551 + Name: "symbol-regexp-nomatch", 552 + Q: &query.Symbol{Expr: &query.Regexp{ 553 + Regexp: mustParseRE("^apple.banana$"), 554 + Content: true, 555 + CaseSensitive: true, 556 + }}, 557 + Want: Stats{ 558 + ContentBytesLoaded: 33, // we still have to run regex since "app" matches two documents 559 + IndexBytesLoaded: 8, 560 + FilesConsidered: 2, // important that we don't check 3 to ensure we are using the index 561 + FilesLoaded: 2, 562 + MatchCount: 0, // even though there is a match it doesn't align with a symbol 563 + ShardsScanned: 1, 564 + NgramMatches: 3, 565 + NgramLookups: 11, 566 + }, 567 + }, { 568 + Name: "symbol-regexp", 569 + Q: &query.Symbol{Expr: &query.Regexp{ 570 + Regexp: mustParseRE("^app.e$"), 571 + Content: true, 572 + CaseSensitive: true, 573 + }}, 574 + Want: Stats{ 575 + ContentBytesLoaded: 35, 576 + IndexBytesLoaded: 2, 577 + FileCount: 2, 578 + FilesConsidered: 2, // must be 2 to ensure we used the index 579 + FilesLoaded: 2, 580 + MatchCount: 2, // apple symbols is in two files 581 + ShardsScanned: 1, 582 + NgramMatches: 2, 583 + NgramLookups: 2, 501 584 }, 502 585 }} 503 586
+2
matchtree.go
··· 196 196 197 197 func (t *symbolRegexpMatchTree) prepare(doc uint32) { 198 198 t.reEvaluated = false 199 + t.found = t.found[:0] 200 + t.matchTree.prepare(doc) 199 201 } 200 202 201 203 func (t *symbolRegexpMatchTree) matches(cp *contentProvider, cost int, known map[matchTree]bool) (bool, bool) {