Commits
The /a/ prefix is required for authenticated git operations in default Gerrit
configurations. Gerrit's HttpScheme intentionally includes /a/ in clone URLs
to trigger authentication. Removing it breaks instances using default config.
The ServerInfo API already returns the correct URL format based on the
instance's configuration, so we should use it as-is.
Amp-Thread-ID: https://ampcode.com/threads/T-6944ab20-837d-4e78-a9ad-51083263baec
Co-authored-by: Amp <amp@ampcode.com>
Co-authored-by: John Mason <jmason@gitlab.com>
* Fix git.Open() fs argument for .git dir instead of worktree
The filesystem argument submitted to git.Open() is of the .git
directory, it should be for the worktree, as in go-git this logic is copied from:
https://github.com/go-git/go-git/blob/145daf2492dd86b1d7e7787555ebd9837b93bff8/repository.go#L373-L375
One side effect of this bug was that go-git would refuse to fetch repo
submodules
* Fix missing RepoURLs and LineFragments for result subrepositories
RepoURLs for subrepositories are currently filtered out when results
are streamed per-repo from shards.
* Add submodule indexing without repo cache
When not using repo cache, index git submodules recursively into
subrepos using the go-git API
* Adjust subrepo naming conventions
- Revert repo-cache behavior to unchanged
- Add support for naming repos and modules with no remote (use the dirname for the
former and subrepopath for the latter)
- Add test for submodule no-repo-cache behavior
* Fix overallocation in repoURL and lineFragment filtering
* Revert basename treatment from 08f67597
Missed the fact that it is already done in zoekt-git-index
command, the test was improperly configured
https://github.com/sourcegraph/zoekt/blob/4e4a529c3b63c7d4c7897ba736f1cd52cc163134/cmd/zoekt-git-index/main.go#L93-L100
Co-authored-by: ARyzhikov <ARyzhikov@nota.tech>
We want Zoekt and Sourcegraph to use the same language package. In this PR we move the languages package from Sourcegraph to Zoekt, so that Zoekt can use it and Sourcegraph can import it.
Notes:
- Zoekt doesn't need to fetch content async which is why I added a little helper func `GetLanguagesFromContent` to make the call sites in Zoekt less awkward.
- Sourcegraph's languages package always classified .cls files as Apex, while Zoekt did a content based check. With this PR we follow Zoekt's approach. Specifically, I removed .cls from `unsupportedByEnryExtensionToNameMap`. I added an additional unit test to cover this case.
Test plan:
I appended the test cases from the old Zoekt languages packages to the tests I copied over from Sourcegraph
In https://github.com/sourcegraph/zoekt/pull/962 a new field metadata
was added to zoekt.Repository, but it was not added to the grpc
types and converter methods, which is why the fuzz test occasionally
fails.
Test plan:
CI
This was part of an effort to move the queue from Zoekt to Sourcegraph.
However, we are not going to pursue this for now and we can remove
the corresponging grpc methods.
Sourcegraph never had callers of Index and Delete, so both are save to remove.
`DeleteAllData` is still being called from Sourcegraph.
Test plan:
CI
e2e tests were failing because github.com/sourcegraph/cody repository has been set to private and all tags were removed
Test plan:
CI
Currently if you switch to universal-ctags 6.2.0 it will log a few
warnings on startup. The old version of go-ctags would error out. Now
instead it will log to its info log.
Test Plan: Added the problematic file in the linked issue to ./foo/repo
directory. Then ran "go run ./cmd/zoekt-index -require_ctags -index
./foo/index ./foo/repo" with uctags 6.2.0.
This fixes https://github.com/sourcegraph/zoekt/issues/967.
I reported that also here: https://gitlab.com/gitlab-org/gitlab-zoekt-indexer/-/issues/91
Previously we silently ignored it. For example a search like `(foo))`
would just work since we would only parse `(foo)`. We now report back to
the user the problem.
Note: we already had special handling for unbalanced parenthesis like
this `((foo)`.
Test Plan: added more test cases. Especially tried to explore areas we
could validly not consume the whole input but couldn't find one.
This was a leftover from the commit merged yesterday for code that was
removed.
Test Plan: go test ./...
At GitLab, we encountered limitations when searching within large namespaces
containing thousands of repositories. Specifically, we cannot pass a complete
list of RepoIDs due to size constraints.
This change introduces support for indexing and searching on custom repository
metadata by extending Repository to include an additional Metadata field.
All fields within Repository.Metadata are searchable using a regular
expression evaluator.
This enables more scalable filtering by allowing clients to express regular
expression prefix queries on metadata fields, such as:
traversal_ids:123-456-.*
Or any field really:
haystack:nee.*le
github.com/xanzy/go-gitlab has been archived and is now maintained by
gitlab. Update the import path and catch-up with the API changes.
The only place this was called outside of the tenant package was for
deciding if we should call tenant.Log. However, we can inline that check
and both simplify the tenant exported API as well as how you call
tenant.Log.
Test Plan: CI
We only want to use ID based shard names on multi-tenant instances.
Before this change we decided this based on if tenant enforcement was
turned on. However, we would like to always have tenant enforcment on at
Sourcegraph but not migrate all shards to the new naming scheme (just
yet).
This change is quite simple. However, it works since we never actually
read the filenames to extract tenant ID or repository ID. We only ever
use the data that is persisted inside of the shard to do the
enforcement.
The environment variable we read (WORKSPACES_API_URL) is the same one we
use in the Sourcegraph codebase to determine multi-tenant specific code.
Test Plan: Added a unit test. Will do a much larger manual E2E test
with the Sourcegraph project.
It had exactly one call site and the logic was super simple. So I think
it is clearer to just inline the logic here.
Test Plan: mechanical refactor so just CI. In a future PR I will be
adding better test coverage just on this function.
These are duplicated with RepositoryDescription. There is exactly one
place we set these values, and in that case we set it in the repository
description as well.
Note: we update some the calls in builder.go for o.IndexOptions.RepoID
to be just o.RepoID. This is to make it consistent with how we access
TenantID. IndexOptions is embedded in that struct (indexArgs) which is
why we can do the shortcut.
Test Plan: go test is sufficient for this change. I will do a larger
round of manual tests soon with all my changes stacked.
Asked an agent to inspect the codebase and docs and update the docs to
match reality more. I ended up not using all of its suggestions, but
these are the ones that seem reasonable.
Test Plan: n/a
We have two places with duplicated logic around how it decides the layout of
shards on disk. This now moves that decision into one place.
Additionally we can now unexport index.ShardName. It was only used in one
place outside the package, and that was easy to replace with a hardcoded
string since it is just a test.
Test Plan: Just CI. This has no actual change in functionality, just
refactoring.
We no longer consume these images so lets stop building them.
Additionally we recently rotated some credentials and this step broke.
Rather than fixing it I think we should just remove it since we don't
use the artifacts anymore.
Test Plan: watch CI on main
Generated this and refined it to make sense. Hopefully this helps.
Test Plan: n/a
We instead rely on the environment which is inheritted to make this
decision. The logic will become a little different in the next PR which
will instead decide layout based on if we are in "workspaces" mode vs
just wanting to enforce tenants.
Note: this is Sourcegraph specific. We want to move to always enforcing
tenant in our environments, even outside of our multitenant
environments.
Test Plan: go test
Hi zoekt-Team,
when testing with our local instance I noticed that it seemed that delta builds where not correctly indexing only the delta that I expected, but much more. I looked at prepareDeltaBuild() and prepareNormalBuild() and found that prepareNormalBuild() expands the branches before processing them with expandBranches(), but prepareDeltaBuild() does not. This leads to a lot more indexing, because prepareDeltaBuild() can't find any known branches (that match the wildcard).
I added a call of expandBranches() to prepareDeltaBuild() to fix this and also added a test.
Regards,
Daniel
This fixes an annoying problem where after running `go test ./...` locally,
your local git `user.name` would be set to `thomas`.
This adds the before and after Lines in the search results. The default is
still 0 so the user has to increase "context lines" to see more.
Somehow I just found out about `gen-proto.sh` recently, after many months of
working on Zoekt! This PR adds brief `README` files in the proto directories to
guide people (and LLMs) to it.
Small refactor to simplify the logic in `trace.New`. It inlines a couple public
methods that don't need to be exposed separately.
This PR re-enables the go-git optimization that was disabled in https://github.com/sourcegraph/zoekt/pull/870 because it caused a huge bug. This PR removes the part of the optimization that was causing the bug, setting `LargeObjectThreshold`.
An alternative would be to track down the root cause of the bug in go-git, so that we could keep setting `LargeObjectThreshold`. However, I don't feel that's worth it as:
* The memory improvement from this change was pretty small (see https://github.com/sourcegraph/zoekt/pull/854)
* Eventually we want to move away from go-git towards a "vanilla" direct git usage
Make the flag "token" optional. Exit if the user provided token does not
exist (as before). Only, fallback to the token from the default location
if the user did not provide a token. Continue with an unauthenticated
GitHub client (`nil` OAuth client) if the fallback did not work (does
not exist, cannot be read, ...).
This PR fixes a bug where Zoekt would not compile on 32-bit architectures. It
also takes the opportunity to start using the `math` library everywhere instead
of our own constants like `maxUInt32` to help prevent this sort of issue in the
future by encouraging devs to select the most accurate "max" type for their
specific situation.
Closes https://github.com/sourcegraph/zoekt/issues/935
Closes SPLF-913
This only affects Sourcegraph.
This adds Index and Delete methods to
sourcegraph-indexserver's GRPC server.
The methods aren't called yet by Sourcegraph.
Notable differences to our current indexing loop:
- IndexOptions are pushed by the client instead of pulled by Zoekt
- A semaphore restricts concurrency
- New ENV `SRC_STOP_INDEXING` stops the "competing" indexing loop. This is just for
testing and will eventually be removed
- The Index method tries to recover a possibly older index from
`indexDir/.trash` first before starting to index
Review:
I recommend reviewing the `.proto` file first, followed by `main.go`
Test plan:
- I did a bunch of manual testing to test the trash recovery and the
semaphore logic. I successfuly triggered index and delete calls to the
GRPC server. I will put the Insomnia collection I used for testing in the ticket.
- New unit tests
In https://github.com/sourcegraph/zoekt/pull/901 I moved several packages to 'internal' to clean up the exported API. This PR moves the `shards` package back to root, since it contains important methods like `NewDirectorySearcher`. It also renames the `shards` package to `search` to clarify the usage.
Relates to https://github.com/sourcegraph/zoekt/pull/901#issuecomment-2703171432, https://github.com/sourcegraph/zoekt/discussions/927
The gatherBranches() function shifts a mask of type uint64 until it is 0, but
uses an id for accessing a map of type uint32. This results in empty branch
names getting returned from this function. It can be fixed by changing the type
of id to also be uint64.
The latest release of gopls has a feature called modernize which will
update your code where it can to use modern go features/pkgs.
https://github.com/golang/tools/releases/tag/gopls%2Fv0.18.0
Generated with:
go run golang.org/x/tools/gopls/internal/analysis/modernize/cmd/modernize@latest -fix -test ./...
Test Plan: CI
In testing, I noticed another problem with BM25: sometimes a binary file is ranked highly because of a match on its filename. In classic Zoekt scoring, these are ranked low because they are skipped, and we always sort skipped docs to the end of the index.
This PR ensures they're also ranked low for BM25 by adding a 'binary' category, and marking it low priority. Adding this category required updating `SkipReason` to track the reason a document was skipped. This is necessary because we set the content of skipped docs to `nil`, and `SkipReason` is the only lasting indication that it was binary.
Relates to #920
Relates to SPLF-874
This implements DeleteAllData. We hold the global lock while deleting all simple shards belonging to a tenant. We also handle compound shards by disassembling them first.
Note that this "only" deletes persisted data. Updating the queue, for example, seems fragile because it might immediately get updated by Sourcegraph. This implies that Sourcegraph first has to delete the tenant in the Sourcegraph DB first and then call this new endpoint. Even if the queue still has a reference to a deleted tenant, indexserver won't be able to retrieve index options or clone the repo from gitserver.
Test plan:
- new unit tests
- manual testing:
I ran this together with Sourcegraph and triggered a delete by calling DeleteAllData directly. I confirmed that all shards, including compound shards are deleted.
This PR reworks the way we incorporate file signals into BM25. Previously, we were applying them as a tie-breaker. But in dogfooding, we found that these rarely impact results, because it's so rare to have a tie in BM25 scores.
Now, we take the file signal into account when computing BM25F. The interpretation is that this data lives in a separate 'field' that is half the priority of regular content.
Relates to SPLF-874
This adds a grpc server to sourcegraph-indexserver. For now it supports just one method.
The diff is quite big, so I left comments to mark the most important bits.
I used the opportunity to clean up a bit (=> hence the big diff):
- Reuse grpc logic from webserver and move those bits to "/gprc/..."
- Move "protos" inside the new "grpc" directory (=> requires changes of import statements in Sourcegraph)
- Refactor import aliases for grpc packages across the codebase
Test plan:
I tested this locally by calling the new grpc endpoint directly.
This was left over from our PageRank prototype.
With this change we recognize boosted queries in our bm25 scoring and
adjust the overall score accordingly.
We need to take care of 2 parts: The overall bm25 score of the document,
and the line score determining the order in which we return the chunks.
Co-authored-by: Julie Tibshirani <julietibs@apache.org>
We add a new section to shards which contains a roaring bitmap for quickly checking if a shard contains a repo ID. We then can load just this (small amount) of data to rule out a compound shard. We use roaring bitmaps since we already have that dependency in our codebase.
The reason we speed up this operation is we found on a large instance which contained thousands of tiny repos we spent so much time in findShard that our indexing queue would always fall behind.
It is possible this new section won't speed this up enough and we need some sort of global oracle (or in-memory cache in indexserver?). This is noted in the code for future travellers.
Test Plan: the existing unit tests already cover if this is forwards and backwards compatible. Additionally I added some logging to zoekt to test if older version of shards still work correctly in findShard, as well as if older versions of zoekt can read the new shards.
Added a benchmark to check the impact. See comments in the code.
---------
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
This adds repo freshness and file order as tiebreakers to the final
bm25 score, just like we have for Zoekt's default scoring.
During testing I found that it is a lot less likely for the tiebreakers
to have an effect with BM25 because the score depends on qualites of the
document, such as the relative length and number of matches, which
usually differ even with the quality of the match is similar.
Test plan:
- Score tests still pass
- manual testing: see screenshots
We remove IDF from our BM25 scoring, effectively treating it as constant.
This is supported by our evaluations which showed that for keyword style queries, IDF can down-weight the score of important keywords too much, leading to a worse ranking. The intuition is that for code search, each keyword is important independently of how frequent it appears in the corpus.
Removing IDF allows us to apply BM25 scoring to a wider range of query types. Previously, BM25 was limited to queries with individual terms combined using OR, as IDF was calculated on the fly at query time.
Test plan:
updated tests
Fixes a bug I introduced in #891.
When skipping a doc, we currently report the detected language as "binary" (if
it looks like binary) or "skipped" (if it's skipped for any other reason).
Skipped docs are still added to the index and can still be returned as search
results, for example if you only match on filename. So sometimes file matches
are returned with "skipped" as their language, even though the file path is
clearly some other language like XML.
This PR updates the indexing logic to still detect the language even if the
document is skipped. However, we avoid passing the contents to the language
detection library to avoid running detection on huge files.
---------
Co-authored-by: Julie Tibshirani <julietibs@apache.org>
We added support for Windows in https://github.com/sourcegraph/zoekt/pull/535. We then partially reverted the change, specifically the parts related to mmap (#706). This was okay because we no longer build for Windows.
This PR fully removes support to avoid being in a partially-implemented state.
When navigating the code, I've often forgotten the difference between
`NewBuilder` and `NewIndexBuilder`. This rename clarifies that one of these
indexes a whole repo, while the other builds individual shards. Also
`index.NewShardBuilder` sounds better.
Follow up to #905. I realized the README looked much better with a title!
This PR updates the README to clarify Zoekt's current design, and explain the main usage patterns.
The /a/ prefix is required for authenticated git operations in default Gerrit
configurations. Gerrit's HttpScheme intentionally includes /a/ in clone URLs
to trigger authentication. Removing it breaks instances using default config.
The ServerInfo API already returns the correct URL format based on the
instance's configuration, so we should use it as-is.
Amp-Thread-ID: https://ampcode.com/threads/T-6944ab20-837d-4e78-a9ad-51083263baec
Co-authored-by: Amp <amp@ampcode.com>
* Fix git.Open() fs argument for .git dir instead of worktree
The filesystem argument submitted to git.Open() is of the .git
directory, it should be for the worktree, as in go-git this logic is copied from:
https://github.com/go-git/go-git/blob/145daf2492dd86b1d7e7787555ebd9837b93bff8/repository.go#L373-L375
One side effect of this bug was that go-git would refuse to fetch repo
submodules
* Fix missing RepoURLs and LineFragments for result subrepositories
RepoURLs for subrepositories are currently filtered out when results
are streamed per-repo from shards.
* Add submodule indexing without repo cache
When not using repo cache, index git submodules recursively into
subrepos using the go-git API
* Adjust subrepo naming conventions
- Revert repo-cache behavior to unchanged
- Add support for naming repos and modules with no remote (use the dirname for the
former and subrepopath for the latter)
- Add test for submodule no-repo-cache behavior
* Fix overallocation in repoURL and lineFragment filtering
* Revert basename treatment from 08f67597
Missed the fact that it is already done in zoekt-git-index
command, the test was improperly configured
https://github.com/sourcegraph/zoekt/blob/4e4a529c3b63c7d4c7897ba736f1cd52cc163134/cmd/zoekt-git-index/main.go#L93-L100
We want Zoekt and Sourcegraph to use the same language package. In this PR we move the languages package from Sourcegraph to Zoekt, so that Zoekt can use it and Sourcegraph can import it.
Notes:
- Zoekt doesn't need to fetch content async which is why I added a little helper func `GetLanguagesFromContent` to make the call sites in Zoekt less awkward.
- Sourcegraph's languages package always classified .cls files as Apex, while Zoekt did a content based check. With this PR we follow Zoekt's approach. Specifically, I removed .cls from `unsupportedByEnryExtensionToNameMap`. I added an additional unit test to cover this case.
Test plan:
I appended the test cases from the old Zoekt languages packages to the tests I copied over from Sourcegraph
This was part of an effort to move the queue from Zoekt to Sourcegraph.
However, we are not going to pursue this for now and we can remove
the corresponging grpc methods.
Sourcegraph never had callers of Index and Delete, so both are save to remove.
`DeleteAllData` is still being called from Sourcegraph.
Test plan:
CI
Currently if you switch to universal-ctags 6.2.0 it will log a few
warnings on startup. The old version of go-ctags would error out. Now
instead it will log to its info log.
Test Plan: Added the problematic file in the linked issue to ./foo/repo
directory. Then ran "go run ./cmd/zoekt-index -require_ctags -index
./foo/index ./foo/repo" with uctags 6.2.0.
Previously we silently ignored it. For example a search like `(foo))`
would just work since we would only parse `(foo)`. We now report back to
the user the problem.
Note: we already had special handling for unbalanced parenthesis like
this `((foo)`.
Test Plan: added more test cases. Especially tried to explore areas we
could validly not consume the whole input but couldn't find one.
At GitLab, we encountered limitations when searching within large namespaces
containing thousands of repositories. Specifically, we cannot pass a complete
list of RepoIDs due to size constraints.
This change introduces support for indexing and searching on custom repository
metadata by extending Repository to include an additional Metadata field.
All fields within Repository.Metadata are searchable using a regular
expression evaluator.
This enables more scalable filtering by allowing clients to express regular
expression prefix queries on metadata fields, such as:
traversal_ids:123-456-.*
Or any field really:
haystack:nee.*le
We only want to use ID based shard names on multi-tenant instances.
Before this change we decided this based on if tenant enforcement was
turned on. However, we would like to always have tenant enforcment on at
Sourcegraph but not migrate all shards to the new naming scheme (just
yet).
This change is quite simple. However, it works since we never actually
read the filenames to extract tenant ID or repository ID. We only ever
use the data that is persisted inside of the shard to do the
enforcement.
The environment variable we read (WORKSPACES_API_URL) is the same one we
use in the Sourcegraph codebase to determine multi-tenant specific code.
Test Plan: Added a unit test. Will do a much larger manual E2E test
with the Sourcegraph project.
These are duplicated with RepositoryDescription. There is exactly one
place we set these values, and in that case we set it in the repository
description as well.
Note: we update some the calls in builder.go for o.IndexOptions.RepoID
to be just o.RepoID. This is to make it consistent with how we access
TenantID. IndexOptions is embedded in that struct (indexArgs) which is
why we can do the shortcut.
Test Plan: go test is sufficient for this change. I will do a larger
round of manual tests soon with all my changes stacked.
We have two places with duplicated logic around how it decides the layout of
shards on disk. This now moves that decision into one place.
Additionally we can now unexport index.ShardName. It was only used in one
place outside the package, and that was easy to replace with a hardcoded
string since it is just a test.
Test Plan: Just CI. This has no actual change in functionality, just
refactoring.
We instead rely on the environment which is inheritted to make this
decision. The logic will become a little different in the next PR which
will instead decide layout based on if we are in "workspaces" mode vs
just wanting to enforce tenants.
Note: this is Sourcegraph specific. We want to move to always enforcing
tenant in our environments, even outside of our multitenant
environments.
Test Plan: go test
Hi zoekt-Team,
when testing with our local instance I noticed that it seemed that delta builds where not correctly indexing only the delta that I expected, but much more. I looked at prepareDeltaBuild() and prepareNormalBuild() and found that prepareNormalBuild() expands the branches before processing them with expandBranches(), but prepareDeltaBuild() does not. This leads to a lot more indexing, because prepareDeltaBuild() can't find any known branches (that match the wildcard).
I added a call of expandBranches() to prepareDeltaBuild() to fix this and also added a test.
Regards,
Daniel
This PR re-enables the go-git optimization that was disabled in https://github.com/sourcegraph/zoekt/pull/870 because it caused a huge bug. This PR removes the part of the optimization that was causing the bug, setting `LargeObjectThreshold`.
An alternative would be to track down the root cause of the bug in go-git, so that we could keep setting `LargeObjectThreshold`. However, I don't feel that's worth it as:
* The memory improvement from this change was pretty small (see https://github.com/sourcegraph/zoekt/pull/854)
* Eventually we want to move away from go-git towards a "vanilla" direct git usage
Make the flag "token" optional. Exit if the user provided token does not
exist (as before). Only, fallback to the token from the default location
if the user did not provide a token. Continue with an unauthenticated
GitHub client (`nil` OAuth client) if the fallback did not work (does
not exist, cannot be read, ...).
This PR fixes a bug where Zoekt would not compile on 32-bit architectures. It
also takes the opportunity to start using the `math` library everywhere instead
of our own constants like `maxUInt32` to help prevent this sort of issue in the
future by encouraging devs to select the most accurate "max" type for their
specific situation.
Closes https://github.com/sourcegraph/zoekt/issues/935
Closes SPLF-913
This only affects Sourcegraph.
This adds Index and Delete methods to
sourcegraph-indexserver's GRPC server.
The methods aren't called yet by Sourcegraph.
Notable differences to our current indexing loop:
- IndexOptions are pushed by the client instead of pulled by Zoekt
- A semaphore restricts concurrency
- New ENV `SRC_STOP_INDEXING` stops the "competing" indexing loop. This is just for
testing and will eventually be removed
- The Index method tries to recover a possibly older index from
`indexDir/.trash` first before starting to index
Review:
I recommend reviewing the `.proto` file first, followed by `main.go`
Test plan:
- I did a bunch of manual testing to test the trash recovery and the
semaphore logic. I successfuly triggered index and delete calls to the
GRPC server. I will put the Insomnia collection I used for testing in the ticket.
- New unit tests
In https://github.com/sourcegraph/zoekt/pull/901 I moved several packages to 'internal' to clean up the exported API. This PR moves the `shards` package back to root, since it contains important methods like `NewDirectorySearcher`. It also renames the `shards` package to `search` to clarify the usage.
Relates to https://github.com/sourcegraph/zoekt/pull/901#issuecomment-2703171432, https://github.com/sourcegraph/zoekt/discussions/927
The latest release of gopls has a feature called modernize which will
update your code where it can to use modern go features/pkgs.
https://github.com/golang/tools/releases/tag/gopls%2Fv0.18.0
Generated with:
go run golang.org/x/tools/gopls/internal/analysis/modernize/cmd/modernize@latest -fix -test ./...
Test Plan: CI
In testing, I noticed another problem with BM25: sometimes a binary file is ranked highly because of a match on its filename. In classic Zoekt scoring, these are ranked low because they are skipped, and we always sort skipped docs to the end of the index.
This PR ensures they're also ranked low for BM25 by adding a 'binary' category, and marking it low priority. Adding this category required updating `SkipReason` to track the reason a document was skipped. This is necessary because we set the content of skipped docs to `nil`, and `SkipReason` is the only lasting indication that it was binary.
Relates to #920
Relates to SPLF-874
This implements DeleteAllData. We hold the global lock while deleting all simple shards belonging to a tenant. We also handle compound shards by disassembling them first.
Note that this "only" deletes persisted data. Updating the queue, for example, seems fragile because it might immediately get updated by Sourcegraph. This implies that Sourcegraph first has to delete the tenant in the Sourcegraph DB first and then call this new endpoint. Even if the queue still has a reference to a deleted tenant, indexserver won't be able to retrieve index options or clone the repo from gitserver.
Test plan:
- new unit tests
- manual testing:
I ran this together with Sourcegraph and triggered a delete by calling DeleteAllData directly. I confirmed that all shards, including compound shards are deleted.
This PR reworks the way we incorporate file signals into BM25. Previously, we were applying them as a tie-breaker. But in dogfooding, we found that these rarely impact results, because it's so rare to have a tie in BM25 scores.
Now, we take the file signal into account when computing BM25F. The interpretation is that this data lives in a separate 'field' that is half the priority of regular content.
Relates to SPLF-874
This adds a grpc server to sourcegraph-indexserver. For now it supports just one method.
The diff is quite big, so I left comments to mark the most important bits.
I used the opportunity to clean up a bit (=> hence the big diff):
- Reuse grpc logic from webserver and move those bits to "/gprc/..."
- Move "protos" inside the new "grpc" directory (=> requires changes of import statements in Sourcegraph)
- Refactor import aliases for grpc packages across the codebase
Test plan:
I tested this locally by calling the new grpc endpoint directly.
With this change we recognize boosted queries in our bm25 scoring and
adjust the overall score accordingly.
We need to take care of 2 parts: The overall bm25 score of the document,
and the line score determining the order in which we return the chunks.
Co-authored-by: Julie Tibshirani <julietibs@apache.org>
We add a new section to shards which contains a roaring bitmap for quickly checking if a shard contains a repo ID. We then can load just this (small amount) of data to rule out a compound shard. We use roaring bitmaps since we already have that dependency in our codebase.
The reason we speed up this operation is we found on a large instance which contained thousands of tiny repos we spent so much time in findShard that our indexing queue would always fall behind.
It is possible this new section won't speed this up enough and we need some sort of global oracle (or in-memory cache in indexserver?). This is noted in the code for future travellers.
Test Plan: the existing unit tests already cover if this is forwards and backwards compatible. Additionally I added some logging to zoekt to test if older version of shards still work correctly in findShard, as well as if older versions of zoekt can read the new shards.
Added a benchmark to check the impact. See comments in the code.
---------
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
This adds repo freshness and file order as tiebreakers to the final
bm25 score, just like we have for Zoekt's default scoring.
During testing I found that it is a lot less likely for the tiebreakers
to have an effect with BM25 because the score depends on qualites of the
document, such as the relative length and number of matches, which
usually differ even with the quality of the match is similar.
Test plan:
- Score tests still pass
- manual testing: see screenshots
We remove IDF from our BM25 scoring, effectively treating it as constant.
This is supported by our evaluations which showed that for keyword style queries, IDF can down-weight the score of important keywords too much, leading to a worse ranking. The intuition is that for code search, each keyword is important independently of how frequent it appears in the corpus.
Removing IDF allows us to apply BM25 scoring to a wider range of query types. Previously, BM25 was limited to queries with individual terms combined using OR, as IDF was calculated on the fly at query time.
Test plan:
updated tests
When skipping a doc, we currently report the detected language as "binary" (if
it looks like binary) or "skipped" (if it's skipped for any other reason).
Skipped docs are still added to the index and can still be returned as search
results, for example if you only match on filename. So sometimes file matches
are returned with "skipped" as their language, even though the file path is
clearly some other language like XML.
This PR updates the indexing logic to still detect the language even if the
document is skipped. However, we avoid passing the contents to the language
detection library to avoid running detection on huge files.
---------
Co-authored-by: Julie Tibshirani <julietibs@apache.org>