Commits
Instead of checking stdout, we check the combined output. By default, git fetch writes to stderr.
Previously this was only populated by the web package for serving a
specific endpoint. We now populate it in all places we do aggregation.
Test Plan: updated unit tests. Additionally loaded zoekt-webserver then
added to the index a repository which required 2 shards to index. The
repository count shown only increased by 1.
I think this is our older ranking stuff.
Test Plan: go test
This would of caught the nil panic which was fixed in the last commit.
Test Plan: go test
We missed this path in the last PR, since we assumed we did an RPC to
get this information. This change has a slightly larger blast radius
because indexserver needs to read back the index file to find out the
index time. Previously it just used inputs to the index when informing
sourcegraph.
Test Plan: pretty robust unit tests here.
SG CI depends on the docker images being present. Now that our docker
images take longer to build with scip the race is easier to hit.
Test Plan: monitor main
We want this information in Sourcegraph. This turned into a bit more of
a hairy change due to our custom marshalling (for perf) and grpc
validation.
Test Plan: lots of unit test coverage
* Add scip-ctags to docker
* musl-ify
* Hardcode commit
* Pin dockerfile git commit
* Use archive
* Bump commit
This expands the debug output for keyword scoring to include the main elements of the score. This will help debug some surprising behavior where scores are quite different on dot com vs. S2.
This adds a gRPC API alongside the existing gob API.
It is enabled whenever the RPC setting is enabled. I didn't think it made sense to have a separate setting to enable it. It is only used when a gRPC request is detected (Content-Type: application/grpc). Eventually, we should likely open a separate port for gRPC traffic, but this should be okay for now.
In order to minimize the footprint of this change, we only use the protobuf definitions in the RPC layer. They are translated to/from the existing go types. This results in a small perf penalty.
* Add alternate ctags parser and language map
* Fix minor issues
* Lint protobuf
* Fix nil access
* Remove debug statements
* Small fixes
* ctagsAddSymbols -> ctagsAddSymbolsParser
* Split out parser logic
* Fix reviews
We never test or use this code path. I am also unaware of any users of
this code. If we hear complaints I'd recommend adding this code back in
a way that conforms to the parser interface.
Test Plan: go test
This PR adds an experimental option `UseKeywordScoring` for scoring file matches using [BM25](https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/), the most common approach in keyword search. It treats each match in a file as a term and uses term frequencies to compute an approximation to BM25.
For now, it makes several simplifications:
* We drop inverse document frequency (idf) from the formula, since we don't have access to global term statistics
* There is no special handling for case sensitivity, filename matches, or symbols
* It ignores all other scoring signals, since BM25 is not normalized and it can be hard to combine it with other signals
The `indexData.Search` method is super long. This PR pulls most scoring logic
into its own function, which will make it easier to modify in future PRs.
Testing:
* All Zoekt tests still pass
* Ran example queries with/ without the change, and the search-debug output was
the same
This is to resolve CVE-2023-25652 and CVE-2023-29007.
Additionally we update the gitindex tests to specify the default branch
to use. This was to fix local failures I was experiencing.
Test Plan: go test inside of nix develop. Additionally ensure the
version of git was larger than 2.38.5 when testing.
Co-authored-by: davejrt <davetry@gmail.com>
This copies how we do it in the Sourcegraph repo, using the same lock
file to ensure we are using the same versions.
Test Plan: nix develop
Before this change case sensitive symbol searches of the form
\bLITERAL\b would result in the error message
found *zoekt.andMatchTree inside query.Symbol
The root cause is the symbol search code is pretty janky. It constructs
a matchtree and then pulls out either the regex matchtree or the substr
matchtree, then creates a specific symbol searcher for those. It feels
like it could be more generic rather than a bunch of hard to follow copy
pasta. For now this was a more straightforward fix than creating a word
search for symbols.
Test Plan: expanded the unit tests to cover word search more often.
Additionally run zoekt-webserver and checked that a search like the
following works now:
sym:\bnewMatchTree\b case:yes
This makes it so you can run submodule tests without first adjusting
your global environment to allow file clones. This was a somewhat recent
change in git to require opting in. Previously we would just set the
config in CI, but now we just always do it _just_ for that test run.
Test Plan: go test and CI
The search-core team has been renamed to search-team, resulting in the failure of the sync job.
tiny refactor to make the code more succint. Based on a comment in #562.
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
fixes a bug where we wouldn't display branches correctly if the query contained one or more branch filters.
Add support for parsing queries for zoekt.fork and zoekt.public.
I removed the code for binary search ngrams in #540, so this check is
not needed anymore.
Even when a repo has ranking data, certain files will not have ranks, like
Markdown or yaml files. Currently these have rank 0, which puts them at a big
disadvantage and means they're usually ranked last.
This PR proposes to use the mean repo rank instead of 0. The rules:
* If we have a concrete rank for the file, always use it
* If there's no rank, and it's a low priority file like a test, then use rank 0
* Otherwise use the mean rank for the repository
We don't attempt to handle the case where an entire repo is missing ranks
because it doesn't have precise code intel.
Co-authored-by: Camden Cheek <camden@ccheek.com>
This just links to the HTML source where there are already example
queries. This will help with discovery and linking people to docs where
they aren't running a local Zoekt instance.
We are updating the ranking service API to only return a single rank
representing the log of each file's reference counts. This PR adapts
zoekt-indexserver to the new API shape, and updates the normalization strategy.
For now, it doesn't change the on-disk format to keep the change simple.
The file score includes a "repo rank" signal, which is based on the
repository's number of stars. Previously, we were aggressively normalizing the
number of stars, which made the repo ranks small and close together. This PR
changes the normalization to spread it out better over the full range. This
increases its contribution to the score.
This adds an integration test for sharded search when document ranking is
enabled. It checks that results are combined correctly across multiple shards,
and that the ranked combined results are streamed out.
Before, the collector aggregated all file matches before flushing, when it
finally sorted and truncated them. Now we sort and limit while collecting
results, instead of at the end. This can help cut down on memory in the case
there are many shard results. (It won't help with `count: all` queries, where
MaxDocDisplayCount is not set.)
Another small benefit is that we obey FlushWallTime more closely. Before, we
might sort a very large number of results after already using up the whole time.
This changes how atom-score is calculated to better reflect the
complexity of the result or the user's intent.
Currently, the atom-score is the ratio "atomMatchCount/totalAtomAcount".
However, "totalAtomAcount" is based on the pruned match-tree which may
be very different from the user query. For example, a match-tree for the
query "foo or bar" is pruned to (a matchtree representing the query)
"foo" if the shard doesn't contain the trigram "bar". In an extreme
case, a query like "foo or bar or bas or qux" can receive the maximum
atom-score if a repo just contains matches for foo.
Assuming that the score of a match should reflect how close a match is
to the user's intent, we should rather base the atom-score on the
original unpruned match-tree. However, the original match-tree is
already based on a simplified query, (see "d.simplify(q)" in eval.go)
Hence I change the scoring function for atom-score to be based on the
absolute count and to assymptotically approach scoreFactorAtomMatch.
This also makes the score more comparable across shards.
This previously was only being used for storing thre repos. Since bare
repos take up much more disk space that the indexes and have different
IO requirements they need to be on a separate disk. We may later
introduce a more generic data_dir which can then set default values for
repo_dir and index_dir but for now it's simpler to make these
2 options explicit.
The use of a b-tree has proven to be successful. Webserver's heap for a production instance of Sourcegraph shrank by 45% without a noticable impact on latency. Hence we set the b-tree as default. It can be disabled by setting ZOEKT_DISABLE_BTREE=true for webserver.
At the same time we remove binarySearchNgram, because it uses more disk accesses than the b-tree and keeps a reference to the posting offsets.
Once the b-tree has proven itself over a longer period of time, I will remove the alternative code path (combinedNgramOffset) and clean up the temporary structs and interfaces I created.
Ensures the new "flush collect sender" is exercised in tests.
Previously this was only populated by the web package for serving a
specific endpoint. We now populate it in all places we do aggregation.
Test Plan: updated unit tests. Additionally loaded zoekt-webserver then
added to the index a repository which required 2 shards to index. The
repository count shown only increased by 1.
We missed this path in the last PR, since we assumed we did an RPC to
get this information. This change has a slightly larger blast radius
because indexserver needs to read back the index file to find out the
index time. Previously it just used inputs to the index when informing
sourcegraph.
Test Plan: pretty robust unit tests here.
This adds a gRPC API alongside the existing gob API.
It is enabled whenever the RPC setting is enabled. I didn't think it made sense to have a separate setting to enable it. It is only used when a gRPC request is detected (Content-Type: application/grpc). Eventually, we should likely open a separate port for gRPC traffic, but this should be okay for now.
In order to minimize the footprint of this change, we only use the protobuf definitions in the RPC layer. They are translated to/from the existing go types. This results in a small perf penalty.
This PR adds an experimental option `UseKeywordScoring` for scoring file matches using [BM25](https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/), the most common approach in keyword search. It treats each match in a file as a term and uses term frequencies to compute an approximation to BM25.
For now, it makes several simplifications:
* We drop inverse document frequency (idf) from the formula, since we don't have access to global term statistics
* There is no special handling for case sensitivity, filename matches, or symbols
* It ignores all other scoring signals, since BM25 is not normalized and it can be hard to combine it with other signals
This is to resolve CVE-2023-25652 and CVE-2023-29007.
Additionally we update the gitindex tests to specify the default branch
to use. This was to fix local failures I was experiencing.
Test Plan: go test inside of nix develop. Additionally ensure the
version of git was larger than 2.38.5 when testing.
Before this change case sensitive symbol searches of the form
\bLITERAL\b would result in the error message
found *zoekt.andMatchTree inside query.Symbol
The root cause is the symbol search code is pretty janky. It constructs
a matchtree and then pulls out either the regex matchtree or the substr
matchtree, then creates a specific symbol searcher for those. It feels
like it could be more generic rather than a bunch of hard to follow copy
pasta. For now this was a more straightforward fix than creating a word
search for symbols.
Test Plan: expanded the unit tests to cover word search more often.
Additionally run zoekt-webserver and checked that a search like the
following works now:
sym:\bnewMatchTree\b case:yes
This makes it so you can run submodule tests without first adjusting
your global environment to allow file clones. This was a somewhat recent
change in git to require opting in. Previously we would just set the
config in CI, but now we just always do it _just_ for that test run.
Test Plan: go test and CI
Even when a repo has ranking data, certain files will not have ranks, like
Markdown or yaml files. Currently these have rank 0, which puts them at a big
disadvantage and means they're usually ranked last.
This PR proposes to use the mean repo rank instead of 0. The rules:
* If we have a concrete rank for the file, always use it
* If there's no rank, and it's a low priority file like a test, then use rank 0
* Otherwise use the mean rank for the repository
We don't attempt to handle the case where an entire repo is missing ranks
because it doesn't have precise code intel.
The file score includes a "repo rank" signal, which is based on the
repository's number of stars. Previously, we were aggressively normalizing the
number of stars, which made the repo ranks small and close together. This PR
changes the normalization to spread it out better over the full range. This
increases its contribution to the score.
Before, the collector aggregated all file matches before flushing, when it
finally sorted and truncated them. Now we sort and limit while collecting
results, instead of at the end. This can help cut down on memory in the case
there are many shard results. (It won't help with `count: all` queries, where
MaxDocDisplayCount is not set.)
Another small benefit is that we obey FlushWallTime more closely. Before, we
might sort a very large number of results after already using up the whole time.
This changes how atom-score is calculated to better reflect the
complexity of the result or the user's intent.
Currently, the atom-score is the ratio "atomMatchCount/totalAtomAcount".
However, "totalAtomAcount" is based on the pruned match-tree which may
be very different from the user query. For example, a match-tree for the
query "foo or bar" is pruned to (a matchtree representing the query)
"foo" if the shard doesn't contain the trigram "bar". In an extreme
case, a query like "foo or bar or bas or qux" can receive the maximum
atom-score if a repo just contains matches for foo.
Assuming that the score of a match should reflect how close a match is
to the user's intent, we should rather base the atom-score on the
original unpruned match-tree. However, the original match-tree is
already based on a simplified query, (see "d.simplify(q)" in eval.go)
Hence I change the scoring function for atom-score to be based on the
absolute count and to assymptotically approach scoreFactorAtomMatch.
This also makes the score more comparable across shards.
This previously was only being used for storing thre repos. Since bare
repos take up much more disk space that the indexes and have different
IO requirements they need to be on a separate disk. We may later
introduce a more generic data_dir which can then set default values for
repo_dir and index_dir but for now it's simpler to make these
2 options explicit.
The use of a b-tree has proven to be successful. Webserver's heap for a production instance of Sourcegraph shrank by 45% without a noticable impact on latency. Hence we set the b-tree as default. It can be disabled by setting ZOEKT_DISABLE_BTREE=true for webserver.
At the same time we remove binarySearchNgram, because it uses more disk accesses than the b-tree and keeps a reference to the posting offsets.
Once the b-tree has proven itself over a longer period of time, I will remove the alternative code path (combinedNgramOffset) and clean up the temporary structs and interfaces I created.