Commits
This PR adds an experimental option `UseKeywordScoring` for scoring file matches using [BM25](https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/), the most common approach in keyword search. It treats each match in a file as a term and uses term frequencies to compute an approximation to BM25.
For now, it makes several simplifications:
* We drop inverse document frequency (idf) from the formula, since we don't have access to global term statistics
* There is no special handling for case sensitivity, filename matches, or symbols
* It ignores all other scoring signals, since BM25 is not normalized and it can be hard to combine it with other signals
The `indexData.Search` method is super long. This PR pulls most scoring logic
into its own function, which will make it easier to modify in future PRs.
Testing:
* All Zoekt tests still pass
* Ran example queries with/ without the change, and the search-debug output was
the same
This is to resolve CVE-2023-25652 and CVE-2023-29007.
Additionally we update the gitindex tests to specify the default branch
to use. This was to fix local failures I was experiencing.
Test Plan: go test inside of nix develop. Additionally ensure the
version of git was larger than 2.38.5 when testing.
Co-authored-by: davejrt <davetry@gmail.com>
This copies how we do it in the Sourcegraph repo, using the same lock
file to ensure we are using the same versions.
Test Plan: nix develop
Before this change case sensitive symbol searches of the form
\bLITERAL\b would result in the error message
found *zoekt.andMatchTree inside query.Symbol
The root cause is the symbol search code is pretty janky. It constructs
a matchtree and then pulls out either the regex matchtree or the substr
matchtree, then creates a specific symbol searcher for those. It feels
like it could be more generic rather than a bunch of hard to follow copy
pasta. For now this was a more straightforward fix than creating a word
search for symbols.
Test Plan: expanded the unit tests to cover word search more often.
Additionally run zoekt-webserver and checked that a search like the
following works now:
sym:\bnewMatchTree\b case:yes
This makes it so you can run submodule tests without first adjusting
your global environment to allow file clones. This was a somewhat recent
change in git to require opting in. Previously we would just set the
config in CI, but now we just always do it _just_ for that test run.
Test Plan: go test and CI
The search-core team has been renamed to search-team, resulting in the failure of the sync job.
tiny refactor to make the code more succint. Based on a comment in #562.
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
fixes a bug where we wouldn't display branches correctly if the query contained one or more branch filters.
Add support for parsing queries for zoekt.fork and zoekt.public.
I removed the code for binary search ngrams in #540, so this check is
not needed anymore.
Even when a repo has ranking data, certain files will not have ranks, like
Markdown or yaml files. Currently these have rank 0, which puts them at a big
disadvantage and means they're usually ranked last.
This PR proposes to use the mean repo rank instead of 0. The rules:
* If we have a concrete rank for the file, always use it
* If there's no rank, and it's a low priority file like a test, then use rank 0
* Otherwise use the mean rank for the repository
We don't attempt to handle the case where an entire repo is missing ranks
because it doesn't have precise code intel.
Co-authored-by: Camden Cheek <camden@ccheek.com>
This just links to the HTML source where there are already example
queries. This will help with discovery and linking people to docs where
they aren't running a local Zoekt instance.
We are updating the ranking service API to only return a single rank
representing the log of each file's reference counts. This PR adapts
zoekt-indexserver to the new API shape, and updates the normalization strategy.
For now, it doesn't change the on-disk format to keep the change simple.
The file score includes a "repo rank" signal, which is based on the
repository's number of stars. Previously, we were aggressively normalizing the
number of stars, which made the repo ranks small and close together. This PR
changes the normalization to spread it out better over the full range. This
increases its contribution to the score.
This adds an integration test for sharded search when document ranking is
enabled. It checks that results are combined correctly across multiple shards,
and that the ranked combined results are streamed out.
Before, the collector aggregated all file matches before flushing, when it
finally sorted and truncated them. Now we sort and limit while collecting
results, instead of at the end. This can help cut down on memory in the case
there are many shard results. (It won't help with `count: all` queries, where
MaxDocDisplayCount is not set.)
Another small benefit is that we obey FlushWallTime more closely. Before, we
might sort a very large number of results after already using up the whole time.
This changes how atom-score is calculated to better reflect the
complexity of the result or the user's intent.
Currently, the atom-score is the ratio "atomMatchCount/totalAtomAcount".
However, "totalAtomAcount" is based on the pruned match-tree which may
be very different from the user query. For example, a match-tree for the
query "foo or bar" is pruned to (a matchtree representing the query)
"foo" if the shard doesn't contain the trigram "bar". In an extreme
case, a query like "foo or bar or bas or qux" can receive the maximum
atom-score if a repo just contains matches for foo.
Assuming that the score of a match should reflect how close a match is
to the user's intent, we should rather base the atom-score on the
original unpruned match-tree. However, the original match-tree is
already based on a simplified query, (see "d.simplify(q)" in eval.go)
Hence I change the scoring function for atom-score to be based on the
absolute count and to assymptotically approach scoreFactorAtomMatch.
This also makes the score more comparable across shards.
This previously was only being used for storing thre repos. Since bare
repos take up much more disk space that the indexes and have different
IO requirements they need to be on a separate disk. We may later
introduce a more generic data_dir which can then set default values for
repo_dir and index_dir but for now it's simpler to make these
2 options explicit.
The use of a b-tree has proven to be successful. Webserver's heap for a production instance of Sourcegraph shrank by 45% without a noticable impact on latency. Hence we set the b-tree as default. It can be disabled by setting ZOEKT_DISABLE_BTREE=true for webserver.
At the same time we remove binarySearchNgram, because it uses more disk accesses than the b-tree and keeps a reference to the posting offsets.
Once the b-tree has proven itself over a longer period of time, I will remove the alternative code path (combinedNgramOffset) and clean up the temporary structs and interfaces I created.
Ensures the new "flush collect sender" is exercised in tests.
* Replace unix/syscall `mmap` with `mmap-go`
`mmap-go` is a cross-platform memory-mapping package.
It uses `Mmap` on *Nix systems
and `CreateFileMapping` + `MapViewOfFile` on Windows.
`mmap-go` benchmarks the same as `unix.Mmap`
(see [https://github.com/peterguy/benchmark-mmap](https://github.com/peterguy/benchmark-mmap)).
BONUS! `indexfile_other.go` used straight file reads (`os.File::ReadAt`)
to support the same kind of behavior as mmap on non-*Nix platforms;
`mmap-go` uses Windows APIs to do real memory mapping on Windows,
which is faster by several magnitudes.
Also change the rounding/padding of the memory buffer being allocated.
It was hard-coded to round up to the nearest 4096 bytes
in order to page align for mmap,
but not all operating systems use 4k pages.
Instead of the hard coded value, use `os.Getpagesize()`.
* Build-constrain usages of `unix.Umask`
* build-constrain zoekt webserver for cross-platform
* Fix buffer size in Windows
and refacter to a method that calculates
the correct buffer size for the platform
* use gopsutil v3 for disk space metrics
instead of unix/syscall Statfs
* Build-constrain the Prometheus memory map metrics
Windows does not have the concept of a count of mapped memory areas.
Non-linux operating systems do not use the /sys pseudo-filesystem.
Constrain `mustRegisterMemoryMapMetrics` to linux-only builds.
Other platforms use a no-op function.
* cleanup
- go get -u -t ./..
- remove unused import
* Update cmd/zoekt-webserver/main.go
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
---------
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
A common search Zoekt gets from Sourcegraph is "\bLITERAL\b". With this PR we avoid the regex engine for these type of queries and provide something faster.
Local benchmarks show that the new code runs 4.8x faster for select queries.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
With this change we use the b-tree for filnames just like we do for content. Based on heap profiles from production instances, we anticipate a reduction of the heap by 15-20%. The change is behind the feature flag "ZOEKT_ENABLE_BTREE_NAME".
With this change we expose "the maximum allowed priority" and "the minimum age
of the latest commit" as command line flags and ENVs.
At the same time we clean up the Server struct a bit and bundle all
fields related to shard merging in a dedicated options struct.
The motiviation is to make shard merging fully configurable before
rolling it out to all customers.
The btree still referenced the offsets to the posting lists which had a
big impact on the heap usage of webserver.
With this change we just hold on to the simple section pointing to the
offsets IE we read offsets lazily and therfore incur another disk
access.
I added a unit test for ngramIndex.Get, because there wasn't really a
good test so far that captured the expected size of a posting list.
In total, after this change, the retrieval of a posting list for a given
ngram via the btree requires 2 disk accesses (1 for getting the bucket
ngrams, 1 for getting the offset to the posting list).
If this turns out to be to costly with regards to latency, we should
consider storing the offset together with the ngrams in the bucket.
This is intended to be used for making it possible for Zoekt to power code
search in GitLab.
Co-authored-by: Dmitry Gruzd <dgruzd@gitlab.com>
Instead of creating bucket offsets on read, we calculate the bucket
offset for a given bucket index on the fly.
This is a new parameter that can be passed in the request body to filter
searches to a list of repositories by `repoid`. This makes use of a new
`RepoIds` query filter.
Repeat the fix from #506.
This PR proposes to simplify how match scores are combined with static ranking
signals like the file rank. Instead of using reciprocal rank fusion (RRF), we
stick with the same strategy we already use to combine ranking signals, a
weighted sum. In my experience, this is much easier to debug and tune compared
to RRF. We're in full control of the ranking signals, and can make sure they're
bounded + meaningful, so using a sum seems totally fine.
It also removes the index sorting based on file rank. From my testing, it
didn't really make a difference to improving result quality. Removing it also
opens up the possibility of storing the file ranks outside the immutable index
data. In the future, this would let us update the ranks more often, without
worrying about fully reindexing repositories.
Our in-memory map of ngrams to posting lists contributes about 33% to webserver's heap usage in a production setting. This replaces the current implementation with a B+-tree. We expect to reduce the memory footprint of the mapping by around 95% at the cost of slower search speeds. The change is behind a feature-flag. To activate, set ZOEKT_ENABLE_BTREE=true in ENV.
Previously this endpoint was always returning an empty response for
valid requests. This is because the last line failed to encode the
SearchResult as JSON. After some investigation I can see that this is
because we default the Progress fields to `-Inf`. Since these progress
fields only seemm to be applicable for streaming use cases that use RPC
and don't encode as JSON then it seemed like the best fix was to simply
omit them in the JSON response.
This change also adds error handling to the JSON encoding so that these
errors will be simpler to debug in future. A test is also included that
reproduces the issue.
You can reproduce this issue and the fix locally by just curling the
search endpoint like:
```bash
curl -H 'Accept: application/json' -XPOST -d '{"Q":"a.*"}' 'http://127.0.0.1:6070/api/search' -i
```
We already have two implementations, combinedNgramOffset and
binarySearchNgram, for the inverted index from ngrams to posting lists.
With the btree I am working on, a third implementation will follow soon.
The current apporach guarantees a zero-value that won't panic in the
tests, however, I don't think we should pile on top of that for the
third implementation, but rather handle nil in the code and return an
error where it makes sense.
with the `golang.org/x/sys/unix` package.
`syscall` has been frozen since Go 1.3 and deprecated
(https://go.dev/doc/go1.4#major_library_changes).
Using the `golang.org/x/sys/unix` package will bring in bug fixes
and enhancements since `syscall` was frozen in 1.3,
and will pave the way for multi-platform builds
(which will affect only the single-program local install, most likely).
Updated matching rules for tsx.
This is repentance for introducing this bug and others fixing it. This
adds a test to ensure we start up correctly if there are no shards to
load.
Test Plan: Revert the fix 54c8cf0c and check that test fails.
Our RPC system does not support shutdown and hijacks the underlying http
connection. This means shutdown is ineffective and just waits 10s before
calling close. Lets just quit faster in that case.
Test Plan: Run zoekt-webserver with and without rpc. Run kill and
observe behaviour.
The links were absolute and therefore broken. If you just run Zoekt by
itself that's fine, but in conjunction with Sourcegraph the links don't
work.
We have been running with eager runDocSection decoding on sourcegraph.com this
week and have seen improvements to tail latency and average latency. We
believe the lazy decoding was unnecessary since we so often do global symbol
searches that we would constantly be decoding doc section data all the time.
This PR adds an experimental option `UseKeywordScoring` for scoring file matches using [BM25](https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/), the most common approach in keyword search. It treats each match in a file as a term and uses term frequencies to compute an approximation to BM25.
For now, it makes several simplifications:
* We drop inverse document frequency (idf) from the formula, since we don't have access to global term statistics
* There is no special handling for case sensitivity, filename matches, or symbols
* It ignores all other scoring signals, since BM25 is not normalized and it can be hard to combine it with other signals
This is to resolve CVE-2023-25652 and CVE-2023-29007.
Additionally we update the gitindex tests to specify the default branch
to use. This was to fix local failures I was experiencing.
Test Plan: go test inside of nix develop. Additionally ensure the
version of git was larger than 2.38.5 when testing.
Before this change case sensitive symbol searches of the form
\bLITERAL\b would result in the error message
found *zoekt.andMatchTree inside query.Symbol
The root cause is the symbol search code is pretty janky. It constructs
a matchtree and then pulls out either the regex matchtree or the substr
matchtree, then creates a specific symbol searcher for those. It feels
like it could be more generic rather than a bunch of hard to follow copy
pasta. For now this was a more straightforward fix than creating a word
search for symbols.
Test Plan: expanded the unit tests to cover word search more often.
Additionally run zoekt-webserver and checked that a search like the
following works now:
sym:\bnewMatchTree\b case:yes
This makes it so you can run submodule tests without first adjusting
your global environment to allow file clones. This was a somewhat recent
change in git to require opting in. Previously we would just set the
config in CI, but now we just always do it _just_ for that test run.
Test Plan: go test and CI
Even when a repo has ranking data, certain files will not have ranks, like
Markdown or yaml files. Currently these have rank 0, which puts them at a big
disadvantage and means they're usually ranked last.
This PR proposes to use the mean repo rank instead of 0. The rules:
* If we have a concrete rank for the file, always use it
* If there's no rank, and it's a low priority file like a test, then use rank 0
* Otherwise use the mean rank for the repository
We don't attempt to handle the case where an entire repo is missing ranks
because it doesn't have precise code intel.
The file score includes a "repo rank" signal, which is based on the
repository's number of stars. Previously, we were aggressively normalizing the
number of stars, which made the repo ranks small and close together. This PR
changes the normalization to spread it out better over the full range. This
increases its contribution to the score.
Before, the collector aggregated all file matches before flushing, when it
finally sorted and truncated them. Now we sort and limit while collecting
results, instead of at the end. This can help cut down on memory in the case
there are many shard results. (It won't help with `count: all` queries, where
MaxDocDisplayCount is not set.)
Another small benefit is that we obey FlushWallTime more closely. Before, we
might sort a very large number of results after already using up the whole time.
This changes how atom-score is calculated to better reflect the
complexity of the result or the user's intent.
Currently, the atom-score is the ratio "atomMatchCount/totalAtomAcount".
However, "totalAtomAcount" is based on the pruned match-tree which may
be very different from the user query. For example, a match-tree for the
query "foo or bar" is pruned to (a matchtree representing the query)
"foo" if the shard doesn't contain the trigram "bar". In an extreme
case, a query like "foo or bar or bas or qux" can receive the maximum
atom-score if a repo just contains matches for foo.
Assuming that the score of a match should reflect how close a match is
to the user's intent, we should rather base the atom-score on the
original unpruned match-tree. However, the original match-tree is
already based on a simplified query, (see "d.simplify(q)" in eval.go)
Hence I change the scoring function for atom-score to be based on the
absolute count and to assymptotically approach scoreFactorAtomMatch.
This also makes the score more comparable across shards.
This previously was only being used for storing thre repos. Since bare
repos take up much more disk space that the indexes and have different
IO requirements they need to be on a separate disk. We may later
introduce a more generic data_dir which can then set default values for
repo_dir and index_dir but for now it's simpler to make these
2 options explicit.
The use of a b-tree has proven to be successful. Webserver's heap for a production instance of Sourcegraph shrank by 45% without a noticable impact on latency. Hence we set the b-tree as default. It can be disabled by setting ZOEKT_DISABLE_BTREE=true for webserver.
At the same time we remove binarySearchNgram, because it uses more disk accesses than the b-tree and keeps a reference to the posting offsets.
Once the b-tree has proven itself over a longer period of time, I will remove the alternative code path (combinedNgramOffset) and clean up the temporary structs and interfaces I created.
* Replace unix/syscall `mmap` with `mmap-go`
`mmap-go` is a cross-platform memory-mapping package.
It uses `Mmap` on *Nix systems
and `CreateFileMapping` + `MapViewOfFile` on Windows.
`mmap-go` benchmarks the same as `unix.Mmap`
(see [https://github.com/peterguy/benchmark-mmap](https://github.com/peterguy/benchmark-mmap)).
BONUS! `indexfile_other.go` used straight file reads (`os.File::ReadAt`)
to support the same kind of behavior as mmap on non-*Nix platforms;
`mmap-go` uses Windows APIs to do real memory mapping on Windows,
which is faster by several magnitudes.
Also change the rounding/padding of the memory buffer being allocated.
It was hard-coded to round up to the nearest 4096 bytes
in order to page align for mmap,
but not all operating systems use 4k pages.
Instead of the hard coded value, use `os.Getpagesize()`.
* Build-constrain usages of `unix.Umask`
* build-constrain zoekt webserver for cross-platform
* Fix buffer size in Windows
and refacter to a method that calculates
the correct buffer size for the platform
* use gopsutil v3 for disk space metrics
instead of unix/syscall Statfs
* Build-constrain the Prometheus memory map metrics
Windows does not have the concept of a count of mapped memory areas.
Non-linux operating systems do not use the /sys pseudo-filesystem.
Constrain `mustRegisterMemoryMapMetrics` to linux-only builds.
Other platforms use a no-op function.
* cleanup
- go get -u -t ./..
- remove unused import
* Update cmd/zoekt-webserver/main.go
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
---------
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
With this change we expose "the maximum allowed priority" and "the minimum age
of the latest commit" as command line flags and ENVs.
At the same time we clean up the Server struct a bit and bundle all
fields related to shard merging in a dedicated options struct.
The motiviation is to make shard merging fully configurable before
rolling it out to all customers.
The btree still referenced the offsets to the posting lists which had a
big impact on the heap usage of webserver.
With this change we just hold on to the simple section pointing to the
offsets IE we read offsets lazily and therfore incur another disk
access.
I added a unit test for ngramIndex.Get, because there wasn't really a
good test so far that captured the expected size of a posting list.
In total, after this change, the retrieval of a posting list for a given
ngram via the btree requires 2 disk accesses (1 for getting the bucket
ngrams, 1 for getting the offset to the posting list).
If this turns out to be to costly with regards to latency, we should
consider storing the offset together with the ngrams in the bucket.
This PR proposes to simplify how match scores are combined with static ranking
signals like the file rank. Instead of using reciprocal rank fusion (RRF), we
stick with the same strategy we already use to combine ranking signals, a
weighted sum. In my experience, this is much easier to debug and tune compared
to RRF. We're in full control of the ranking signals, and can make sure they're
bounded + meaningful, so using a sum seems totally fine.
It also removes the index sorting based on file rank. From my testing, it
didn't really make a difference to improving result quality. Removing it also
opens up the possibility of storing the file ranks outside the immutable index
data. In the future, this would let us update the ranks more often, without
worrying about fully reindexing repositories.
Our in-memory map of ngrams to posting lists contributes about 33% to webserver's heap usage in a production setting. This replaces the current implementation with a B+-tree. We expect to reduce the memory footprint of the mapping by around 95% at the cost of slower search speeds. The change is behind a feature-flag. To activate, set ZOEKT_ENABLE_BTREE=true in ENV.
Previously this endpoint was always returning an empty response for
valid requests. This is because the last line failed to encode the
SearchResult as JSON. After some investigation I can see that this is
because we default the Progress fields to `-Inf`. Since these progress
fields only seemm to be applicable for streaming use cases that use RPC
and don't encode as JSON then it seemed like the best fix was to simply
omit them in the JSON response.
This change also adds error handling to the JSON encoding so that these
errors will be simpler to debug in future. A test is also included that
reproduces the issue.
You can reproduce this issue and the fix locally by just curling the
search endpoint like:
```bash
curl -H 'Accept: application/json' -XPOST -d '{"Q":"a.*"}' 'http://127.0.0.1:6070/api/search' -i
```
We already have two implementations, combinedNgramOffset and
binarySearchNgram, for the inverted index from ngrams to posting lists.
With the btree I am working on, a third implementation will follow soon.
The current apporach guarantees a zero-value that won't panic in the
tests, however, I don't think we should pile on top of that for the
third implementation, but rather handle nil in the code and return an
error where it makes sense.
with the `golang.org/x/sys/unix` package.
`syscall` has been frozen since Go 1.3 and deprecated
(https://go.dev/doc/go1.4#major_library_changes).
Using the `golang.org/x/sys/unix` package will bring in bug fixes
and enhancements since `syscall` was frozen in 1.3,
and will pave the way for multi-platform builds
(which will affect only the single-program local install, most likely).
We have been running with eager runDocSection decoding on sourcegraph.com this
week and have seen improvements to tail latency and average latency. We
believe the lazy decoding was unnecessary since we so often do global symbol
searches that we would constantly be decoding doc section data all the time.