Commits
This adds a new test and a new repo to our e2e ranking suite. The idea is to have a query for which two repos (here golang/go and conc) of different freshness compete for the best match.
Among other fixes, Go 1.22.6 fixes a code signing issue on macOS and
XCode 16 that prevents Zoetk from running
This PR adds adds a debugging flag to periodically check memory usage against a
threshold. If it exceeds the threshold, then a memory profile like
`indexmemory.prof.1` is written to disk. No more than 10 profiles will be
written.
I've already found this more useful than the existing `-memprofile` flag, so I
removed that. It's hard to get insights using that flag, since it only takes a
single profile per shard, forces GC, and forces parallelism to 1.
While working on scoring I noticed that public repos always have an
advantage compared to private repos, because we have
`doc(rawConfig:RcOnlyPublic|...)` as part of the matchtree, which
contributues +1 to the atom count.
Test plan:
New unit test
Tiny follow up to https://github.com/sourcegraph/zoekt/pull/826. I resolved a conflict incorrectly and reverted a log line improvement.
This is a much more robust detection mechanism. Additionally we have
these signals we can also add in:
func IsConfiguration(path string) bool
func IsDocumentation(path string) bool
func IsDotFile(path string) bool
func IsImage(path string) bool
My main concern with this change is generated file detection on content
using up RAM or CPU. Will monitor this impact on pprof in production.
Test Plan: go test.
Test Plan: go run ./cmd/zoekt-index prints out useful instructions
Looking at heap profiles, the `ReadMetadata` function creates a ton of garbage
objects. The main contributor is in other sections from the TOC, specifically
decoding `compoundSection.offsets` . However, to read metadata, we only really
need to parse the metadata sections.
This PR introduces a `skip` method that skips over a section without reading
it. This greatly reduces the allocations from `ReadMetadata`.
Small improvement to how we handle unknown or malformed sections when reading the TOC.
Note: this PR originally removed some of the "skip section" handling. That has been restored, since it is important for downgrades.
Tiny clean-up, as we rarely use Datadog profiling (and I've never heard of us setting this env var).
This PR initializes the GCP profiler in the `zoekt-git-index` process so we can
examine CPU and memory usage for the indexing process itself.
Inspired by a recent Sourcegraph refactor.
We never index files over 1MB, unless the "LargeFiles" allowlist is set. So in
most cases, we can avoid fetching them at all.
This PR updates the `git fetch` to filter out files over 1MB when possible, and
exclude tags. It also refactors the very long `gitIndex` method.
Test Plan: trivy fs go.mod runs clean
The new metric `index_indexing_delay_seconds` is looking good on dot com and
producing more interpretable results. This PR removes the old metric
`index_queue_age_seconds`. And it increases the bucket range for the new
metric, since we sometimes exceed the current 2 day limit.
`Queue` is always created through `NewQueue`, which always initializes the data
structures. So we don't need to initialize them in other methods.
The `sourcegraph/sourcegraph` repo is now private, so we switch to
`sourcegraph-public-snapshot` for these tests. Eventually, we'll move the
relevant tests to the sourcegraph repo itself.
Our dashboards currently report indexing delay using the
`index_queue_age_seconds` metric, which tracks the duration indexing jobs wait
in the queue before being "popped" for indexing. For dot com, we regularly see
the median for this metric at 2 days. However, this metric may be misleading
because it includes "noop" indexing jobs where the index is already up to date.
These jobs can stay in the queue for a long time since they are not high
priority.
This PR adds a new metric `index_indexing_delay_seconds`, which reports the
duration from when an indexing job enters the queue, to when it completes. It
uses 'state' as a label, so by setting `state='success'` we can (roughly) see
the delay from when Zoekt learns about a new version, to when that version is
available to search.
If `maxFreq=0` then `rand.Intn` will panic. This PR ensures it is always >= 1.
Example failure: https://github.com/sourcegraph/zoekt/actions/runs/10301894396/job/28514416201?pr=804
This is an optional dependency that I don't believe OSS users of docker
use. We used to use these dockerfiles in Sourcegraph, but have since
moved to building it in our own CI with scip-ctags.
Test Plan: built locally
docker build -t zoekt .
docker build -t zoekt-indexserver . -f Dockerfile.indexserver
docker build -t zoekt-webserver . -f Dockerfile.webserver
Remove the dependency on the sourcegraph monorepo
Any write to the index dir triggered a scan. This means on busy
instances we are constantly rescanning, leading to an
over-representation in CPU profiles around watch. The events are
normally writes to our temporary files. By only considering events for
.zoekt files (which is what scan reads) we can avoid the constant scan
calls.
Just in case we also introduce a re-scan every minute in case we miss an
event. There is error handling around this, but I thought it is just
more reliable to call scan every once in a while.
Note: this doesn't represent significant CPU use, but it does muddy the
CPU profiler output. So this makes it easier to understand trends in our
continuous cpu profiling.
Test Plan: CI
This enables shard merging by default for zoekt-sourcegraph-indexserver.
Sourcegraph has been using shard merging in production for several years. We have recently confirmed significant performance improvements for queries which are bound by matchTree construction.
I also remove -merge_max_priority because we have stopped using it.
Use SRC_DISABLE_SHARD_MERGING to disable shard merging.
Test plan:
mostly CI, I did some manual testing to confirm that shard merging is enabled by default for zoekt-sourcegraph-indexserver.
* feat(Search): Add support for all Apex language extensions
* clean up comment
* Fix typo
After defaulting to shard merging for all inactive repos, this in fact
makes searches slightly slower. So we can remove the experiment.
Test Plan: go test
The first bit of data I am getting back indicates this strategy of
limiting the number of ngrams we lookup isn't working. I am still
experimenting with different limits, but in the meantime it is easy to
implement a strategy which picks a random subset. This is so that the
first N ngrams of a query aren't the only ones being consulted.
Test Plan: ran all tests with the envvar set to 2. I expected tests that
assert on stats to fail, but everything else to pass. This was the case.
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT=2 go test ./...
This introduces an experiment where we can stop looking up ngrams at a
certain limit. The insight here is that for large substrings we spend
more time finding the smallest ngram frequency than the time a normal
search takes. So instead we can try and find a good balance between
looking for a good (two) ngrams and actually searching the corpus.
The plan is to set different values for
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT in sourcegraph production and
see how it affects performance of attribution search service.
Test Plan: ran all tests with the envvar set to 2. I expected tests that
assert on stats to fail, but everything else to pass. This was the case.
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT=2 go test ./...
Updates grafana/regexp to the latest version of the speedup branch. This
incorporates the changes from upstream go that have happened.
Test Plan: go test ./...
Add fallbacks for languages not supported yet by linguist or go-enry
So far, we didn't include IDF in our BM25 score function. Zoekt uses a
trigram index and hence doesn't compute document frequency during
indexing. We could add this information to the index, but it is not
immediately obvious how to tokenize code in a way that is compatible
with tokens from a natural language query.
Here we calulate the document frequency at query time under the
assumption that we visit all documents containing any of the query terms.
Notes:
Also fixed an off-by-1 bug with how we count documents.
Test plan:
- Updated unit test
- Context evaluation results are slightly worse with a decrease from 64/89 to 63/89
* Update VSCode launch.json to enable easier debugging
* Add index path
As described, we'd like this information to be externally available to
drive metrics for incremental indexing.
Closes #780.
When we introduced filename boosting in BM25, we set it to a very conservative
weight. This PR increases the weight from 2.0 -> 5.0, which improves results on
relevant evals.
Relates to SPLF-88
Follow up to #779. This PR removes the logic for trigrams with the same
frequency, because it will no longer have a big effect.
We select the two least frequent trigrams to create the candidate match
iterator. It's common for these trigrams to overlap. This change shifts the
first and last trigrams to avoid overlap, which is guaranteed to result in a
smaller intersection. For frequent terms, this can substantially reduce the
number of candidate matches we consider.
It's confusing to call this `UseKeywordScoring`, since we do not use it for the
`keyword` patterntype in Sourcegraph. This commit clarifies the name to mention
BM25.
The current configuration creates a conflict with default
fetch config `+refs/heads/*:refs/heads/*`.
The first fetch is ok but once a local "refs/heads/meta/config" exists
fetch fails because 2 fetch rules match for local "refs/heads/meta/config" :
"refs/meta/config" and "refs/heads/meta/config"
Changing local ref name fixes the problem
This is so that when authentication details change we correctly update
the CloneURL for repositories managed by our mirror commands.
While working on ranking, I noticed that sum-tf is wrong if we have filename and content matches.
We use `finalCands` in our BM25 scoring, however, `finalCands` is modified in `fillChunkMatches` and `fillMatches` which can lead to surprising scores.
Test plan:
updated unit test
* Honor regex flags for case-sensitivity
* change
* add test
* pr comment
The repo is indexed only with the reponame which is more user-friendly
in the search when there is onely one host
Filter string will be "r:my-repo" instead of "r:my-gerrit-sever.com/my-repo"
This allows indexing repos that are used for development and therefore have an
SSH 'origin' URL.
After this update "trivy fs go.mod" has a clean report.
go get golang.org/x/net@v0.23.0 google.golang.org/protobuf@v1.33.0
Test Plan: go test
The images produced are no longer consumed directly by sourcegraph so we
don't have security scanners running on them. Additionally we upgraded
to the latest alpine now. Both combined means we can simplify the apk
add lines.
Test Plan: built locally
docker build -t zoekt .
docker build -t zoekt-indexserver . -f Dockerfile.indexserver
docker build -t zoekt-webserver . -f Dockerfile.webserver
gitindex: include environ for git tests
A recent change introduced GIT_CONFIG_GLOBAL and GIT_CONFIG_SYSTEM, but
ran into the footgun of unsetting the rest of the environ when cmd.Env
is non-empty. This for example broke tests for me since we didn't have
my custom $PATH set.
Instead we revert the change and make a much smaller change to just
always set the relevant environment variables.
This reverts commit ab1b8f09199ea7a731fd80876fc5b0f376ff6882.
Test Plan: go test
We now only consume zoekt via gRPC at Sourcegraph and I doubt anyone
uses the old endpoints.
This will have one required update in sourcegraph, and that is to use
SenderFunc from the main zoekt package rather than from the now deleted
stream package.
This PR adds adds a debugging flag to periodically check memory usage against a
threshold. If it exceeds the threshold, then a memory profile like
`indexmemory.prof.1` is written to disk. No more than 10 profiles will be
written.
I've already found this more useful than the existing `-memprofile` flag, so I
removed that. It's hard to get insights using that flag, since it only takes a
single profile per shard, forces GC, and forces parallelism to 1.
This is a much more robust detection mechanism. Additionally we have
these signals we can also add in:
func IsConfiguration(path string) bool
func IsDocumentation(path string) bool
func IsDotFile(path string) bool
func IsImage(path string) bool
My main concern with this change is generated file detection on content
using up RAM or CPU. Will monitor this impact on pprof in production.
Test Plan: go test.
Looking at heap profiles, the `ReadMetadata` function creates a ton of garbage
objects. The main contributor is in other sections from the TOC, specifically
decoding `compoundSection.offsets` . However, to read metadata, we only really
need to parse the metadata sections.
This PR introduces a `skip` method that skips over a section without reading
it. This greatly reduces the allocations from `ReadMetadata`.
Our dashboards currently report indexing delay using the
`index_queue_age_seconds` metric, which tracks the duration indexing jobs wait
in the queue before being "popped" for indexing. For dot com, we regularly see
the median for this metric at 2 days. However, this metric may be misleading
because it includes "noop" indexing jobs where the index is already up to date.
These jobs can stay in the queue for a long time since they are not high
priority.
This PR adds a new metric `index_indexing_delay_seconds`, which reports the
duration from when an indexing job enters the queue, to when it completes. It
uses 'state' as a label, so by setting `state='success'` we can (roughly) see
the delay from when Zoekt learns about a new version, to when that version is
available to search.
This is an optional dependency that I don't believe OSS users of docker
use. We used to use these dockerfiles in Sourcegraph, but have since
moved to building it in our own CI with scip-ctags.
Test Plan: built locally
docker build -t zoekt .
docker build -t zoekt-indexserver . -f Dockerfile.indexserver
docker build -t zoekt-webserver . -f Dockerfile.webserver
Any write to the index dir triggered a scan. This means on busy
instances we are constantly rescanning, leading to an
over-representation in CPU profiles around watch. The events are
normally writes to our temporary files. By only considering events for
.zoekt files (which is what scan reads) we can avoid the constant scan
calls.
Just in case we also introduce a re-scan every minute in case we miss an
event. There is error handling around this, but I thought it is just
more reliable to call scan every once in a while.
Note: this doesn't represent significant CPU use, but it does muddy the
CPU profiler output. So this makes it easier to understand trends in our
continuous cpu profiling.
Test Plan: CI
This enables shard merging by default for zoekt-sourcegraph-indexserver.
Sourcegraph has been using shard merging in production for several years. We have recently confirmed significant performance improvements for queries which are bound by matchTree construction.
I also remove -merge_max_priority because we have stopped using it.
Use SRC_DISABLE_SHARD_MERGING to disable shard merging.
Test plan:
mostly CI, I did some manual testing to confirm that shard merging is enabled by default for zoekt-sourcegraph-indexserver.
The first bit of data I am getting back indicates this strategy of
limiting the number of ngrams we lookup isn't working. I am still
experimenting with different limits, but in the meantime it is easy to
implement a strategy which picks a random subset. This is so that the
first N ngrams of a query aren't the only ones being consulted.
Test Plan: ran all tests with the envvar set to 2. I expected tests that
assert on stats to fail, but everything else to pass. This was the case.
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT=2 go test ./...
This introduces an experiment where we can stop looking up ngrams at a
certain limit. The insight here is that for large substrings we spend
more time finding the smallest ngram frequency than the time a normal
search takes. So instead we can try and find a good balance between
looking for a good (two) ngrams and actually searching the corpus.
The plan is to set different values for
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT in sourcegraph production and
see how it affects performance of attribution search service.
Test Plan: ran all tests with the envvar set to 2. I expected tests that
assert on stats to fail, but everything else to pass. This was the case.
SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT=2 go test ./...
So far, we didn't include IDF in our BM25 score function. Zoekt uses a
trigram index and hence doesn't compute document frequency during
indexing. We could add this information to the index, but it is not
immediately obvious how to tokenize code in a way that is compatible
with tokens from a natural language query.
Here we calulate the document frequency at query time under the
assumption that we visit all documents containing any of the query terms.
Notes:
Also fixed an off-by-1 bug with how we count documents.
Test plan:
- Updated unit test
- Context evaluation results are slightly worse with a decrease from 64/89 to 63/89
We select the two least frequent trigrams to create the candidate match
iterator. It's common for these trigrams to overlap. This change shifts the
first and last trigrams to avoid overlap, which is guaranteed to result in a
smaller intersection. For frequent terms, this can substantially reduce the
number of candidate matches we consider.
The current configuration creates a conflict with default
fetch config `+refs/heads/*:refs/heads/*`.
The first fetch is ok but once a local "refs/heads/meta/config" exists
fetch fails because 2 fetch rules match for local "refs/heads/meta/config" :
"refs/meta/config" and "refs/heads/meta/config"
Changing local ref name fixes the problem
The images produced are no longer consumed directly by sourcegraph so we
don't have security scanners running on them. Additionally we upgraded
to the latest alpine now. Both combined means we can simplify the apk
add lines.
Test Plan: built locally
docker build -t zoekt .
docker build -t zoekt-indexserver . -f Dockerfile.indexserver
docker build -t zoekt-webserver . -f Dockerfile.webserver
gitindex: include environ for git tests
A recent change introduced GIT_CONFIG_GLOBAL and GIT_CONFIG_SYSTEM, but
ran into the footgun of unsetting the rest of the environ when cmd.Env
is non-empty. This for example broke tests for me since we didn't have
my custom $PATH set.
Instead we revert the change and make a much smaller change to just
always set the relevant environment variables.
This reverts commit ab1b8f09199ea7a731fd80876fc5b0f376ff6882.
Test Plan: go test