Indexing: improve skipped doc handling (#687) · boltless.me/zoekt@e068116

fork of https://github.com/sourcegraph/zoekt

Indexing: improve skipped doc handling (#687)

This change makes a couple small improvements to how we handle skipped docs:
* Immediately skip ctags parsing if the content is `nil`
* Always sort skipped docs to the end of the shard. This seems like a nice
invariant. And generally it's good for performance to group data that is
expected to be accessed together and has similar content.

author

Julie Tibshirani committer

GitHub date 2 years ago (Nov 13, 2023, 11:45 AM -0800) commit e0681161 e068116194eadd9ef4d619fc53371289bf317d58 parent 2355607d 2355607d5ff884be315631cfb1f2ad27bacf10fc

+30 -1

3 changed files

Expand all

build

builder.go

ctags.go

e2e_test.go

build/builder.go

··· 951 951 // at query time, because earlier documents receive a boost at query time and 952 952 // have a higher chance of being searched before limits kick in. 953 953 func rank(d *zoekt.Document, origIdx int) []float64 { 954 + skipped := 0.0 955 + if d.SkipReason != "" { 956 + skipped = 1.0 957 + } 958 + 954 959 generated := 0.0 955 960 if isGenerated(d.Name) { 956 961 generated = 1.0 ··· 968 973 969 974 // Smaller is earlier (=better). 970 975 return []float64{ 976 + // Always place skipped docs last 977 + skipped, 978 + 971 979 // Prefer docs that are not generated 972 980 generated, 973 981

+1 -1

build/ctags.go

··· 49 49 var tagsToSections tagsToSections 50 50 51 51 for _, doc := range todo { 52 - if doc.Symbols != nil { 52 + if len(doc.Content) == 0 || doc.Symbols != nil { 53 53 continue 54 54 } 55 55

+21

build/e2e_test.go

··· 525 525 }, 526 526 }, 527 527 want: []int{0, 2, 1}, 528 + }, { 529 + name: "skipped docs", 530 + docs: []*zoekt.Document{ 531 + { 532 + Name: "binary_file", 533 + SkipReason: "binary file", 534 + }, 535 + { 536 + Name: "some_test.go", 537 + Content: []byte("bla"), 538 + }, 539 + { 540 + Name: "large_file.go", 541 + SkipReason: "too large", 542 + }, 543 + { 544 + Name: "file.go", 545 + Content: []byte("blabla"), 546 + }, 547 + }, 548 + want: []int{3, 1, 0, 2}, 528 549 }} { 529 550 t.Run(c.name, func(t *testing.T) { 530 551 testFileRankAspect(t, c)

Configure Feed

Configure Feed