Commits
Even when a repo has ranking data, certain files will not have ranks, like
Markdown or yaml files. Currently these have rank 0, which puts them at a big
disadvantage and means they're usually ranked last.
This PR proposes to use the mean repo rank instead of 0. The rules:
* If we have a concrete rank for the file, always use it
* If there's no rank, and it's a low priority file like a test, then use rank 0
* Otherwise use the mean rank for the repository
We don't attempt to handle the case where an entire repo is missing ranks
because it doesn't have precise code intel.
Co-authored-by: Camden Cheek <camden@ccheek.com>
This just links to the HTML source where there are already example
queries. This will help with discovery and linking people to docs where
they aren't running a local Zoekt instance.
We are updating the ranking service API to only return a single rank
representing the log of each file's reference counts. This PR adapts
zoekt-indexserver to the new API shape, and updates the normalization strategy.
For now, it doesn't change the on-disk format to keep the change simple.
The file score includes a "repo rank" signal, which is based on the
repository's number of stars. Previously, we were aggressively normalizing the
number of stars, which made the repo ranks small and close together. This PR
changes the normalization to spread it out better over the full range. This
increases its contribution to the score.
This adds an integration test for sharded search when document ranking is
enabled. It checks that results are combined correctly across multiple shards,
and that the ranked combined results are streamed out.
Before, the collector aggregated all file matches before flushing, when it
finally sorted and truncated them. Now we sort and limit while collecting
results, instead of at the end. This can help cut down on memory in the case
there are many shard results. (It won't help with `count: all` queries, where
MaxDocDisplayCount is not set.)
Another small benefit is that we obey FlushWallTime more closely. Before, we
might sort a very large number of results after already using up the whole time.
This changes how atom-score is calculated to better reflect the
complexity of the result or the user's intent.
Currently, the atom-score is the ratio "atomMatchCount/totalAtomAcount".
However, "totalAtomAcount" is based on the pruned match-tree which may
be very different from the user query. For example, a match-tree for the
query "foo or bar" is pruned to (a matchtree representing the query)
"foo" if the shard doesn't contain the trigram "bar". In an extreme
case, a query like "foo or bar or bas or qux" can receive the maximum
atom-score if a repo just contains matches for foo.
Assuming that the score of a match should reflect how close a match is
to the user's intent, we should rather base the atom-score on the
original unpruned match-tree. However, the original match-tree is
already based on a simplified query, (see "d.simplify(q)" in eval.go)
Hence I change the scoring function for atom-score to be based on the
absolute count and to assymptotically approach scoreFactorAtomMatch.
This also makes the score more comparable across shards.
This previously was only being used for storing thre repos. Since bare
repos take up much more disk space that the indexes and have different
IO requirements they need to be on a separate disk. We may later
introduce a more generic data_dir which can then set default values for
repo_dir and index_dir but for now it's simpler to make these
2 options explicit.
The use of a b-tree has proven to be successful. Webserver's heap for a production instance of Sourcegraph shrank by 45% without a noticable impact on latency. Hence we set the b-tree as default. It can be disabled by setting ZOEKT_DISABLE_BTREE=true for webserver.
At the same time we remove binarySearchNgram, because it uses more disk accesses than the b-tree and keeps a reference to the posting offsets.
Once the b-tree has proven itself over a longer period of time, I will remove the alternative code path (combinedNgramOffset) and clean up the temporary structs and interfaces I created.
Ensures the new "flush collect sender" is exercised in tests.
* Replace unix/syscall `mmap` with `mmap-go`
`mmap-go` is a cross-platform memory-mapping package.
It uses `Mmap` on *Nix systems
and `CreateFileMapping` + `MapViewOfFile` on Windows.
`mmap-go` benchmarks the same as `unix.Mmap`
(see [https://github.com/peterguy/benchmark-mmap](https://github.com/peterguy/benchmark-mmap)).
BONUS! `indexfile_other.go` used straight file reads (`os.File::ReadAt`)
to support the same kind of behavior as mmap on non-*Nix platforms;
`mmap-go` uses Windows APIs to do real memory mapping on Windows,
which is faster by several magnitudes.
Also change the rounding/padding of the memory buffer being allocated.
It was hard-coded to round up to the nearest 4096 bytes
in order to page align for mmap,
but not all operating systems use 4k pages.
Instead of the hard coded value, use `os.Getpagesize()`.
* Build-constrain usages of `unix.Umask`
* build-constrain zoekt webserver for cross-platform
* Fix buffer size in Windows
and refacter to a method that calculates
the correct buffer size for the platform
* use gopsutil v3 for disk space metrics
instead of unix/syscall Statfs
* Build-constrain the Prometheus memory map metrics
Windows does not have the concept of a count of mapped memory areas.
Non-linux operating systems do not use the /sys pseudo-filesystem.
Constrain `mustRegisterMemoryMapMetrics` to linux-only builds.
Other platforms use a no-op function.
* cleanup
- go get -u -t ./..
- remove unused import
* Update cmd/zoekt-webserver/main.go
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
---------
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
A common search Zoekt gets from Sourcegraph is "\bLITERAL\b". With this PR we avoid the regex engine for these type of queries and provide something faster.
Local benchmarks show that the new code runs 4.8x faster for select queries.
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
With this change we use the b-tree for filnames just like we do for content. Based on heap profiles from production instances, we anticipate a reduction of the heap by 15-20%. The change is behind the feature flag "ZOEKT_ENABLE_BTREE_NAME".
With this change we expose "the maximum allowed priority" and "the minimum age
of the latest commit" as command line flags and ENVs.
At the same time we clean up the Server struct a bit and bundle all
fields related to shard merging in a dedicated options struct.
The motiviation is to make shard merging fully configurable before
rolling it out to all customers.
The btree still referenced the offsets to the posting lists which had a
big impact on the heap usage of webserver.
With this change we just hold on to the simple section pointing to the
offsets IE we read offsets lazily and therfore incur another disk
access.
I added a unit test for ngramIndex.Get, because there wasn't really a
good test so far that captured the expected size of a posting list.
In total, after this change, the retrieval of a posting list for a given
ngram via the btree requires 2 disk accesses (1 for getting the bucket
ngrams, 1 for getting the offset to the posting list).
If this turns out to be to costly with regards to latency, we should
consider storing the offset together with the ngrams in the bucket.
This is intended to be used for making it possible for Zoekt to power code
search in GitLab.
Co-authored-by: Dmitry Gruzd <dgruzd@gitlab.com>
Instead of creating bucket offsets on read, we calculate the bucket
offset for a given bucket index on the fly.
This is a new parameter that can be passed in the request body to filter
searches to a list of repositories by `repoid`. This makes use of a new
`RepoIds` query filter.
Repeat the fix from #506.
This PR proposes to simplify how match scores are combined with static ranking
signals like the file rank. Instead of using reciprocal rank fusion (RRF), we
stick with the same strategy we already use to combine ranking signals, a
weighted sum. In my experience, this is much easier to debug and tune compared
to RRF. We're in full control of the ranking signals, and can make sure they're
bounded + meaningful, so using a sum seems totally fine.
It also removes the index sorting based on file rank. From my testing, it
didn't really make a difference to improving result quality. Removing it also
opens up the possibility of storing the file ranks outside the immutable index
data. In the future, this would let us update the ranks more often, without
worrying about fully reindexing repositories.
Our in-memory map of ngrams to posting lists contributes about 33% to webserver's heap usage in a production setting. This replaces the current implementation with a B+-tree. We expect to reduce the memory footprint of the mapping by around 95% at the cost of slower search speeds. The change is behind a feature-flag. To activate, set ZOEKT_ENABLE_BTREE=true in ENV.
Previously this endpoint was always returning an empty response for
valid requests. This is because the last line failed to encode the
SearchResult as JSON. After some investigation I can see that this is
because we default the Progress fields to `-Inf`. Since these progress
fields only seemm to be applicable for streaming use cases that use RPC
and don't encode as JSON then it seemed like the best fix was to simply
omit them in the JSON response.
This change also adds error handling to the JSON encoding so that these
errors will be simpler to debug in future. A test is also included that
reproduces the issue.
You can reproduce this issue and the fix locally by just curling the
search endpoint like:
```bash
curl -H 'Accept: application/json' -XPOST -d '{"Q":"a.*"}' 'http://127.0.0.1:6070/api/search' -i
```
We already have two implementations, combinedNgramOffset and
binarySearchNgram, for the inverted index from ngrams to posting lists.
With the btree I am working on, a third implementation will follow soon.
The current apporach guarantees a zero-value that won't panic in the
tests, however, I don't think we should pile on top of that for the
third implementation, but rather handle nil in the code and return an
error where it makes sense.
with the `golang.org/x/sys/unix` package.
`syscall` has been frozen since Go 1.3 and deprecated
(https://go.dev/doc/go1.4#major_library_changes).
Using the `golang.org/x/sys/unix` package will bring in bug fixes
and enhancements since `syscall` was frozen in 1.3,
and will pave the way for multi-platform builds
(which will affect only the single-program local install, most likely).
Updated matching rules for tsx.
This is repentance for introducing this bug and others fixing it. This
adds a test to ensure we start up correctly if there are no shards to
load.
Test Plan: Revert the fix 54c8cf0c and check that test fails.
Our RPC system does not support shutdown and hijacks the underlying http
connection. This means shutdown is ineffective and just waits 10s before
calling close. Lets just quit faster in that case.
Test Plan: Run zoekt-webserver with and without rpc. Run kill and
observe behaviour.
The links were absolute and therefore broken. If you just run Zoekt by
itself that's fine, but in conjunction with Sourcegraph the links don't
work.
We have been running with eager runDocSection decoding on sourcegraph.com this
week and have seen improvements to tail latency and average latency. We
believe the lazy decoding was unnecessary since we so often do global symbol
searches that we would constantly be decoding doc section data all the time.
Co-authored-by: Sourcegraph <batch-changes@sourcegraph.com>
Full debug context here: https://sourcegraph.slack.com/archives/C023ELQLV7F/p1672835804462349
Commit 9899a9b3f475ef066ed70c395f8b303268f5d00c changed what the
response looks like when Zoekt has never loaded a shard: it now reports
a `Crashes = 1`, even if everything's fine.
That leads to upstream errors where we show an error message in the
Sourcegraph admin UI because the customer hasn't added any repositories
to their instance yet.
The fix here changes the `loader` to also mark the `shardedSearcher` as
ready if there was nothing to load. Previously the `markReady` was
skipped, the searcher wasn't marked as "ready" and every search query
was replied to with a `Crashes = 1`
The repos are now hidden by default. If visible they are rendered as
html table instead of buttons inside a form.
Currently taking up 90% of object allocations for sg. This commit
reduces that by 90%. The next step we can take is instead of
returning a map we use a slice.
name old time/op new time/op delta
RepoList_Encode-8 166µs ± 4% 84µs ± 3% -49.12% (p=0.000 n=10+10)
RepoList_Decode-8 519µs ± 2% 148µs ± 1% -71.46% (p=0.000 n=10+10)
name old bytes new bytes delta
RepoList_Encode-8 57.8kB ± 0% 51.1kB ± 0% -11.66% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
RepoList_Encode-8 4.06kB ± 0% 57.34kB ± 0% +1311.02% (p=0.000 n=10+10)
RepoList_Decode-8 316kB ± 0% 259kB ± 0% -17.96% (p=0.000 n=10+7)
name old allocs/op new allocs/op delta
RepoList_Encode-8 1.00k ± 0% 0.00k ± 0% -99.90% (p=0.000 n=10+10)
RepoList_Decode-8 6.59k ± 0% 0.59k ± 0% -91.09% (p=0.000 n=10+10)
Test Plan: go test
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
Co-authored-by: Camden Cheek <camden@ccheek.com>
We use two different versions, charset=utf-8 and charset=utf8, in the
code. The former seems to be the standard, hence we rename the latter.
We serve the hostname so that we can associate a repo with its
indexserver on the repo status page in Sourcegraph.
Now, reindex requests are handled asynchronously, which means the client
won't have to wait around for big repos to be indexed. This makes things
easier to handle once we call this endpoint from Sourcegraph.
Instead of failing hard, we simply log an move on. So far this
connection is only used for a nonessential feature ("reindex button").
A pattern within search.largeFiles can negate files which match an earlier pattern, which would exclude that file from being able to ignore/override the size max threshold.
* log: different msg depending on ctx error
* use errors.Is to check ctx errors
* lower log severity to warn
This is the final commit implementing a non-zero Crashes value while
shardedSearcher is startin up. This is meant to communicate when a zoekt
is partially available. When landed, Sourcegraph's APIs will communicate
that the user may have partial results and should retry.
Test Plan: go test. Additionally added a large sleep to loader.load to
simulate a slow startup. Then did searches in zoekt-webserver and
observed it returning non-zero crash counts until all shards loaded.
If this field is false we will communicate back to the client that we
are partially available. We piggy-back on the Crashes stat since that is
currently used by Sourcegraph to communicate back partial availability.
However, once we stabilize here we should introduce more fine grained
fields.
This commit just introduces reads to the field, the next commit will
introduce the logic which mutates this field which will actually change
behaviour.
Test Plan: go test
Fixes a regression around dev logging.
I want to introduce some extra state were the order we access it is
important, so we introduce a new structure loaded which represents the
final computed state. For now thiis is a non-functional change, we only
introduce the structure and update call sites. The next commit will add
a field.
Test Plan: go test
Want the SRC_LOG_SCOPE_LEVEL overrides.
Test Plan: go test ./...
This adds the option to trigger a reindex without rendering "/", which
is going to be useful if we want to trigger a reindex from Sourcegraph.
This enables the proxy for Sourcegraph's webserver
See #487
Even when a repo has ranking data, certain files will not have ranks, like
Markdown or yaml files. Currently these have rank 0, which puts them at a big
disadvantage and means they're usually ranked last.
This PR proposes to use the mean repo rank instead of 0. The rules:
* If we have a concrete rank for the file, always use it
* If there's no rank, and it's a low priority file like a test, then use rank 0
* Otherwise use the mean rank for the repository
We don't attempt to handle the case where an entire repo is missing ranks
because it doesn't have precise code intel.
The file score includes a "repo rank" signal, which is based on the
repository's number of stars. Previously, we were aggressively normalizing the
number of stars, which made the repo ranks small and close together. This PR
changes the normalization to spread it out better over the full range. This
increases its contribution to the score.
Before, the collector aggregated all file matches before flushing, when it
finally sorted and truncated them. Now we sort and limit while collecting
results, instead of at the end. This can help cut down on memory in the case
there are many shard results. (It won't help with `count: all` queries, where
MaxDocDisplayCount is not set.)
Another small benefit is that we obey FlushWallTime more closely. Before, we
might sort a very large number of results after already using up the whole time.
This changes how atom-score is calculated to better reflect the
complexity of the result or the user's intent.
Currently, the atom-score is the ratio "atomMatchCount/totalAtomAcount".
However, "totalAtomAcount" is based on the pruned match-tree which may
be very different from the user query. For example, a match-tree for the
query "foo or bar" is pruned to (a matchtree representing the query)
"foo" if the shard doesn't contain the trigram "bar". In an extreme
case, a query like "foo or bar or bas or qux" can receive the maximum
atom-score if a repo just contains matches for foo.
Assuming that the score of a match should reflect how close a match is
to the user's intent, we should rather base the atom-score on the
original unpruned match-tree. However, the original match-tree is
already based on a simplified query, (see "d.simplify(q)" in eval.go)
Hence I change the scoring function for atom-score to be based on the
absolute count and to assymptotically approach scoreFactorAtomMatch.
This also makes the score more comparable across shards.
This previously was only being used for storing thre repos. Since bare
repos take up much more disk space that the indexes and have different
IO requirements they need to be on a separate disk. We may later
introduce a more generic data_dir which can then set default values for
repo_dir and index_dir but for now it's simpler to make these
2 options explicit.
The use of a b-tree has proven to be successful. Webserver's heap for a production instance of Sourcegraph shrank by 45% without a noticable impact on latency. Hence we set the b-tree as default. It can be disabled by setting ZOEKT_DISABLE_BTREE=true for webserver.
At the same time we remove binarySearchNgram, because it uses more disk accesses than the b-tree and keeps a reference to the posting offsets.
Once the b-tree has proven itself over a longer period of time, I will remove the alternative code path (combinedNgramOffset) and clean up the temporary structs and interfaces I created.
* Replace unix/syscall `mmap` with `mmap-go`
`mmap-go` is a cross-platform memory-mapping package.
It uses `Mmap` on *Nix systems
and `CreateFileMapping` + `MapViewOfFile` on Windows.
`mmap-go` benchmarks the same as `unix.Mmap`
(see [https://github.com/peterguy/benchmark-mmap](https://github.com/peterguy/benchmark-mmap)).
BONUS! `indexfile_other.go` used straight file reads (`os.File::ReadAt`)
to support the same kind of behavior as mmap on non-*Nix platforms;
`mmap-go` uses Windows APIs to do real memory mapping on Windows,
which is faster by several magnitudes.
Also change the rounding/padding of the memory buffer being allocated.
It was hard-coded to round up to the nearest 4096 bytes
in order to page align for mmap,
but not all operating systems use 4k pages.
Instead of the hard coded value, use `os.Getpagesize()`.
* Build-constrain usages of `unix.Umask`
* build-constrain zoekt webserver for cross-platform
* Fix buffer size in Windows
and refacter to a method that calculates
the correct buffer size for the platform
* use gopsutil v3 for disk space metrics
instead of unix/syscall Statfs
* Build-constrain the Prometheus memory map metrics
Windows does not have the concept of a count of mapped memory areas.
Non-linux operating systems do not use the /sys pseudo-filesystem.
Constrain `mustRegisterMemoryMapMetrics` to linux-only builds.
Other platforms use a no-op function.
* cleanup
- go get -u -t ./..
- remove unused import
* Update cmd/zoekt-webserver/main.go
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
---------
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
With this change we expose "the maximum allowed priority" and "the minimum age
of the latest commit" as command line flags and ENVs.
At the same time we clean up the Server struct a bit and bundle all
fields related to shard merging in a dedicated options struct.
The motiviation is to make shard merging fully configurable before
rolling it out to all customers.
The btree still referenced the offsets to the posting lists which had a
big impact on the heap usage of webserver.
With this change we just hold on to the simple section pointing to the
offsets IE we read offsets lazily and therfore incur another disk
access.
I added a unit test for ngramIndex.Get, because there wasn't really a
good test so far that captured the expected size of a posting list.
In total, after this change, the retrieval of a posting list for a given
ngram via the btree requires 2 disk accesses (1 for getting the bucket
ngrams, 1 for getting the offset to the posting list).
If this turns out to be to costly with regards to latency, we should
consider storing the offset together with the ngrams in the bucket.
This PR proposes to simplify how match scores are combined with static ranking
signals like the file rank. Instead of using reciprocal rank fusion (RRF), we
stick with the same strategy we already use to combine ranking signals, a
weighted sum. In my experience, this is much easier to debug and tune compared
to RRF. We're in full control of the ranking signals, and can make sure they're
bounded + meaningful, so using a sum seems totally fine.
It also removes the index sorting based on file rank. From my testing, it
didn't really make a difference to improving result quality. Removing it also
opens up the possibility of storing the file ranks outside the immutable index
data. In the future, this would let us update the ranks more often, without
worrying about fully reindexing repositories.
Our in-memory map of ngrams to posting lists contributes about 33% to webserver's heap usage in a production setting. This replaces the current implementation with a B+-tree. We expect to reduce the memory footprint of the mapping by around 95% at the cost of slower search speeds. The change is behind a feature-flag. To activate, set ZOEKT_ENABLE_BTREE=true in ENV.
Previously this endpoint was always returning an empty response for
valid requests. This is because the last line failed to encode the
SearchResult as JSON. After some investigation I can see that this is
because we default the Progress fields to `-Inf`. Since these progress
fields only seemm to be applicable for streaming use cases that use RPC
and don't encode as JSON then it seemed like the best fix was to simply
omit them in the JSON response.
This change also adds error handling to the JSON encoding so that these
errors will be simpler to debug in future. A test is also included that
reproduces the issue.
You can reproduce this issue and the fix locally by just curling the
search endpoint like:
```bash
curl -H 'Accept: application/json' -XPOST -d '{"Q":"a.*"}' 'http://127.0.0.1:6070/api/search' -i
```
We already have two implementations, combinedNgramOffset and
binarySearchNgram, for the inverted index from ngrams to posting lists.
With the btree I am working on, a third implementation will follow soon.
The current apporach guarantees a zero-value that won't panic in the
tests, however, I don't think we should pile on top of that for the
third implementation, but rather handle nil in the code and return an
error where it makes sense.
with the `golang.org/x/sys/unix` package.
`syscall` has been frozen since Go 1.3 and deprecated
(https://go.dev/doc/go1.4#major_library_changes).
Using the `golang.org/x/sys/unix` package will bring in bug fixes
and enhancements since `syscall` was frozen in 1.3,
and will pave the way for multi-platform builds
(which will affect only the single-program local install, most likely).
We have been running with eager runDocSection decoding on sourcegraph.com this
week and have seen improvements to tail latency and average latency. We
believe the lazy decoding was unnecessary since we so often do global symbol
searches that we would constantly be decoding doc section data all the time.
Full debug context here: https://sourcegraph.slack.com/archives/C023ELQLV7F/p1672835804462349
Commit 9899a9b3f475ef066ed70c395f8b303268f5d00c changed what the
response looks like when Zoekt has never loaded a shard: it now reports
a `Crashes = 1`, even if everything's fine.
That leads to upstream errors where we show an error message in the
Sourcegraph admin UI because the customer hasn't added any repositories
to their instance yet.
The fix here changes the `loader` to also mark the `shardedSearcher` as
ready if there was nothing to load. Previously the `markReady` was
skipped, the searcher wasn't marked as "ready" and every search query
was replied to with a `Crashes = 1`
Currently taking up 90% of object allocations for sg. This commit
reduces that by 90%. The next step we can take is instead of
returning a map we use a slice.
name old time/op new time/op delta
RepoList_Encode-8 166µs ± 4% 84µs ± 3% -49.12% (p=0.000 n=10+10)
RepoList_Decode-8 519µs ± 2% 148µs ± 1% -71.46% (p=0.000 n=10+10)
name old bytes new bytes delta
RepoList_Encode-8 57.8kB ± 0% 51.1kB ± 0% -11.66% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
RepoList_Encode-8 4.06kB ± 0% 57.34kB ± 0% +1311.02% (p=0.000 n=10+10)
RepoList_Decode-8 316kB ± 0% 259kB ± 0% -17.96% (p=0.000 n=10+7)
name old allocs/op new allocs/op delta
RepoList_Encode-8 1.00k ± 0% 0.00k ± 0% -99.90% (p=0.000 n=10+10)
RepoList_Decode-8 6.59k ± 0% 0.59k ± 0% -91.09% (p=0.000 n=10+10)
Test Plan: go test
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
Co-authored-by: Camden Cheek <camden@ccheek.com>
This is the final commit implementing a non-zero Crashes value while
shardedSearcher is startin up. This is meant to communicate when a zoekt
is partially available. When landed, Sourcegraph's APIs will communicate
that the user may have partial results and should retry.
Test Plan: go test. Additionally added a large sleep to loader.load to
simulate a slow startup. Then did searches in zoekt-webserver and
observed it returning non-zero crash counts until all shards loaded.
If this field is false we will communicate back to the client that we
are partially available. We piggy-back on the Crashes stat since that is
currently used by Sourcegraph to communicate back partial availability.
However, once we stabilize here we should introduce more fine grained
fields.
This commit just introduces reads to the field, the next commit will
introduce the logic which mutates this field which will actually change
behaviour.
Test Plan: go test
I want to introduce some extra state were the order we access it is
important, so we introduce a new structure loaded which represents the
final computed state. For now thiis is a non-functional change, we only
introduce the structure and update call sites. The next commit will add
a field.
Test Plan: go test