index: experiment to limit ngram lookups for large snippets (#795)

This introduces an experiment where we can stop looking up ngrams at a
certain limit. The insight here is that for large substrings we spend
more time finding the smallest ngram frequency than the time a normal
search takes. So instead we can try and find a good balance between
looking for a good (two) ngrams and actually searching the corpus.

The plan is to set different values for
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT in sourcegraph production and
see how it affects performance of attribution search service.

Test Plan: ran all tests with the envvar set to 2. I expected tests that
assert on stats to fail, but everything else to pass. This was the case.

SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT=2 go test ./...

author

Keegan Carruthers-Smith committer

GitHub date 2 years ago (Jul 26, 2024, 6:01 PM +0200) commit 12ce07a2 12ce07a298aed45c4ee9fa92f9acfdcb83f0836f parent 5ac92b1a 5ac92b1a7d4ab7b0dbeeaa9df77abb13d555e16b

+35 -2

2 changed files

Expand all

bits.go

indexdata.go

+6 -1

bits.go

··· 124 124 } 125 125 126 126 func splitNGrams(str []byte) []runeNgramOff { 127 + // len(maxNgrams) >= the number of ngrams in str => no limit 128 + return splitNGramsLimit(str, len(str)) 129 + } 130 + 131 + func splitNGramsLimit(str []byte, maxNgrams int) []runeNgramOff { 127 132 var runeGram [3]rune 128 133 var off [3]uint32 129 134 var runeCount int ··· 131 136 result := make([]runeNgramOff, 0, len(str)) 132 137 var i uint32 133 138 134 - for len(str) > 0 { 139 + for len(str) > 0 && len(result) < maxNgrams { 135 140 r, sz := utf8.DecodeRune(str) 136 141 str = str[sz:] 137 142 runeGram[0] = runeGram[1]

+29 -1

indexdata.go

··· 21 21 "hash/crc64" 22 22 "log" 23 23 "math/bits" 24 + "os" 24 25 "slices" 26 + "strconv" 25 27 "unicode/utf8" 26 28 27 29 "github.com/sourcegraph/zoekt/query" ··· 401 403 return cs 402 404 } 403 405 406 + // experimentIterateNgramLookupLimit when non-zero will only lookup this many 407 + // ngrams from a query string. Note: that if case-insensitive, this only 408 + // limits the input. So we will still lookup the case folding. 409 + // 410 + // This experiment is targetting looking up large snippets. If it is 411 + // successful, we will likely hardcode the value we use in production. 412 + // 413 + // Future note: if we find cases where this works badly, we can consider only 414 + // searching a random subset of the query string to avoid bad strings. 415 + var experimentIterateNgramLookupLimit = getEnvInt("SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT") 416 + 417 + func getEnvInt(k string) int { 418 + v, _ := strconv.Atoi(os.Getenv(k)) 419 + if v != 0 { 420 + log.Printf("%s = %d\n", k, v) 421 + } 422 + return v 423 + } 424 + 404 425 func (d *indexData) iterateNgrams(query *query.Substring) (*ngramIterationResults, error) { 405 426 str := query.Pattern 406 427 407 428 // Find the 2 least common ngrams from the string. 408 - ngramOffs := splitNGrams([]byte(query.Pattern)) 429 + var ngramOffs []runeNgramOff 430 + if ngramLimit := experimentIterateNgramLookupLimit; ngramLimit > 0 { 431 + // Note: we can't just do str = str[:ngramLimit] due to utf-8 and str 432 + // length is asked later on for other optimizations. 433 + ngramOffs = splitNGramsLimit([]byte(str), ngramLimit) 434 + } else { 435 + ngramOffs = splitNGrams([]byte(str)) 436 + } 409 437 410 438 // protect against accidental searching of empty strings 411 439 if len(ngramOffs) == 0 {

Configure Feed

Configure Feed