fork of https://github.com/sourcegraph/zoekt
0

Configure Feed

Select the types of activity you want to include in your feed.

index: use a random sample of ngrams when limiting (#797)

The first bit of data I am getting back indicates this strategy of
limiting the number of ngrams we lookup isn't working. I am still
experimenting with different limits, but in the meantime it is easy to
implement a strategy which picks a random subset. This is so that the
first N ngrams of a query aren't the only ones being consulted.

Test Plan: ran all tests with the envvar set to 2. I expected tests that
assert on stats to fail, but everything else to pass. This was the case.

SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT=2 go test ./...

+19 -1
+19 -1
bits.go
··· 18 18 "cmp" 19 19 "encoding/binary" 20 20 "math" 21 + "math/rand/v2" 22 + "slices" 21 23 "sort" 22 24 "unicode" 23 25 "unicode/utf8" ··· 136 138 result := make([]runeNgramOff, 0, len(str)) 137 139 var i uint32 138 140 139 - for len(str) > 0 && len(result) < maxNgrams { 141 + for len(str) > 0 { 140 142 r, sz := utf8.DecodeRune(str) 141 143 str = str[sz:] 142 144 runeGram[0] = runeGram[1] ··· 157 159 index: len(result), 158 160 }) 159 161 } 162 + 163 + // We return a random subset of size maxNgrams. This is to prevent the start 164 + // of the string biasing ngram selection. 165 + if maxNgrams < len(result) { 166 + // Deterministic seed for tests. Additionally makes comparing repeated 167 + // queries performance easier. 168 + r := rand.New(rand.NewPCG(uint64(maxNgrams), 0)) 169 + 170 + // Pick random subset via a shuffle 171 + r.Shuffle(maxNgrams, func(i, j int) { result[i], result[j] = result[j], result[i] }) 172 + result = result[:maxNgrams] 173 + 174 + // Caller expects ngrams in order of appearance. 175 + slices.SortFunc(result, runeNgramOff.Compare) 176 + } 177 + 160 178 return result 161 179 } 162 180