Commits
The links were absolute and therefore broken. If you just run Zoekt by
itself that's fine, but in conjunction with Sourcegraph the links don't
work.
We have been running with eager runDocSection decoding on sourcegraph.com this
week and have seen improvements to tail latency and average latency. We
believe the lazy decoding was unnecessary since we so often do global symbol
searches that we would constantly be decoding doc section data all the time.
Co-authored-by: Sourcegraph <batch-changes@sourcegraph.com>
Full debug context here: https://sourcegraph.slack.com/archives/C023ELQLV7F/p1672835804462349
Commit 9899a9b3f475ef066ed70c395f8b303268f5d00c changed what the
response looks like when Zoekt has never loaded a shard: it now reports
a `Crashes = 1`, even if everything's fine.
That leads to upstream errors where we show an error message in the
Sourcegraph admin UI because the customer hasn't added any repositories
to their instance yet.
The fix here changes the `loader` to also mark the `shardedSearcher` as
ready if there was nothing to load. Previously the `markReady` was
skipped, the searcher wasn't marked as "ready" and every search query
was replied to with a `Crashes = 1`
The repos are now hidden by default. If visible they are rendered as
html table instead of buttons inside a form.
Currently taking up 90% of object allocations for sg. This commit
reduces that by 90%. The next step we can take is instead of
returning a map we use a slice.
name old time/op new time/op delta
RepoList_Encode-8 166µs ± 4% 84µs ± 3% -49.12% (p=0.000 n=10+10)
RepoList_Decode-8 519µs ± 2% 148µs ± 1% -71.46% (p=0.000 n=10+10)
name old bytes new bytes delta
RepoList_Encode-8 57.8kB ± 0% 51.1kB ± 0% -11.66% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
RepoList_Encode-8 4.06kB ± 0% 57.34kB ± 0% +1311.02% (p=0.000 n=10+10)
RepoList_Decode-8 316kB ± 0% 259kB ± 0% -17.96% (p=0.000 n=10+7)
name old allocs/op new allocs/op delta
RepoList_Encode-8 1.00k ± 0% 0.00k ± 0% -99.90% (p=0.000 n=10+10)
RepoList_Decode-8 6.59k ± 0% 0.59k ± 0% -91.09% (p=0.000 n=10+10)
Test Plan: go test
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
Co-authored-by: Camden Cheek <camden@ccheek.com>
We use two different versions, charset=utf-8 and charset=utf8, in the
code. The former seems to be the standard, hence we rename the latter.
We serve the hostname so that we can associate a repo with its
indexserver on the repo status page in Sourcegraph.
Now, reindex requests are handled asynchronously, which means the client
won't have to wait around for big repos to be indexed. This makes things
easier to handle once we call this endpoint from Sourcegraph.
Instead of failing hard, we simply log an move on. So far this
connection is only used for a nonessential feature ("reindex button").
A pattern within search.largeFiles can negate files which match an earlier pattern, which would exclude that file from being able to ignore/override the size max threshold.
* log: different msg depending on ctx error
* use errors.Is to check ctx errors
* lower log severity to warn
This is the final commit implementing a non-zero Crashes value while
shardedSearcher is startin up. This is meant to communicate when a zoekt
is partially available. When landed, Sourcegraph's APIs will communicate
that the user may have partial results and should retry.
Test Plan: go test. Additionally added a large sleep to loader.load to
simulate a slow startup. Then did searches in zoekt-webserver and
observed it returning non-zero crash counts until all shards loaded.
If this field is false we will communicate back to the client that we
are partially available. We piggy-back on the Crashes stat since that is
currently used by Sourcegraph to communicate back partial availability.
However, once we stabilize here we should introduce more fine grained
fields.
This commit just introduces reads to the field, the next commit will
introduce the logic which mutates this field which will actually change
behaviour.
Test Plan: go test
Fixes a regression around dev logging.
I want to introduce some extra state were the order we access it is
important, so we introduce a new structure loaded which represents the
final computed state. For now thiis is a non-functional change, we only
introduce the structure and update call sites. The next commit will add
a field.
Test Plan: go test
Want the SRC_LOG_SCOPE_LEVEL overrides.
Test Plan: go test ./...
This adds the option to trigger a reindex without rendering "/", which
is going to be useful if we want to trigger a reindex from Sourcegraph.
This enables the proxy for Sourcegraph's webserver
See #487
forceIndex should set the lock on index dir just like we do during
normal index operations.
With this change, webserver proxies request to `<webserver>/indexserver/` to the unix domain socket at `<index dir>/indexserver.socket`. Since this is a Sourcegraph feature, we put the proxy handler in webserver behind a flag which is toggled off per default.
The motivation is to expose indexserver endpoints on webserver addresses, which in case of Sourcegraph, are easily accessible at runtime.
Also needed to run "go get github.com/grafana/regexp@speedup" to ensure
we are getting the speedup.
Test Plan: go test ./...
Sourcegraph's new hybrid searcher sends around large ORs of filenames to
search for. FileNameSet provides an efficient implementation to
transport and query these sets of filenames.
We include an efficient gob implementation based on previous work around
encoding a more complicated map of strings.
Test Plan: Added unit tests. Additionally, updated sourcegraph to use
this implementation and ran its tests.
the field lets us document why the collectSender flushed. This is
useful for debugging.
This fixes a bug where if s.FileName and s.Content were TRUE, we didn't
expand and the resulting matchTree was implicitly matching filenames
only.
Now we match content and filenames if s.Filename == s.Content.
Previously it was a hassle to set the debug flag for every search while
debugging. Now we keep track of the state so we only have to set it once.
The current code sets the repo rank of a `rankedShard`. However we don't read `rankedShard.repos` during evaluation, so setting the repo rank had no effect.
Now we set repo rank when we load the shard.
As a consequence, file matches receive a boost (shard-order) based on their shard's priority.
The effective contribution ranges from `[0, scoresShardRankFactor(=20)]`.
Type aliases are super common in Go. From experimentally searching
around I often failed to find something since it was a type alias to a
map. EG when searching Header in the stdlib. So this change makes them
equivalent.
Test Plan: go test
We tune the contribution of document ranks to the final ranking with a damping factor. The zero value of the damping factor leads to an equal contribution of scores and ranks, while a damping factor of 1 effectively eliminates the influence of document ranks.
The default is 0 IE equal contribution.
* web: quote language query param in link
The language query parameter must be quoted because the
language names may contain whitespaces.
This fixes invalid or empty search results when clicking on a language label
in the search result page when the language contains whitespaces.
* web: don't show matches with line number < 1
Ignore matches with an invalid line number.
E.g when doing a simple file search with `f:\.c$`
a match with line number 0 is returned for each file.
This match is ignored, which also slims down the result list.
This header will be enforced from Sourcegraph in the future.
Test Plan: CI
We add a few things in this commit. Namely we add generic ranking which
is based on best guesses on what is important across languages. I
believe this is a useful change since it is mostly true and is likely
better than just treating all symbols as equal.
Secondly I add language specific ranking for python, ruby and php. These
are as usualy best guesses based on experience in the language.
Lastly our test cases for TestReadSearch had the language stored in
lower case. I noticed this thanks to the generic scorer running since
previously the go one was not. Just to be safe I made the switch
statements match against both the expected casing and lowered case of
languages.
Test Plan: go test
With this change we recognize changes in capitalization as word boundaries.
There is a corner case which was caught in the tests, where we falsy
classify "nnerClass" as word, because the preceeding letter in the code
is a capital "I". However, I don't believe this corner case is relevant
in practive and by accepting it we keep the code simple.
This moves the "score" prefix from the template to the debug strings.
This way we have a consistent behaviour in Sourcegraph and Zoekt.
Pure rename, no logic changes. I consistently misspelled the acronym
for Reciprocal Rank Fusion as RFF instead of RRF.
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
We want pipefail behavior and in practice this seems to work in busybox
sh.
We don't need to set this data if we are not using it for ranking.
A couple of test searches revelead that interfaces should rank higher
most of the time. With this change we move interfaces to the top.
If we have a MaxDocDisplayCount set, then there is no need sending all
matches found in a shard when at most MaxDocDisplayCount will end up
being in the final result set.
This should improve ranking and let us do more work per shard. I am
slightly worried about negative interactions with TotalMaxMatchCount so
I am feature flagging this behaviour behind UseDocumentRanks.
Test Plan: unit tested without the feature flag.
The install-ctags-alpine.sh silently fails when wget fails to download the
ctags-install-alpine.sh script. The resulting container image is then missing ctags support.
Update the ctags build script to return non-zero on failure.
The root cause is that the busybox wget implementation can not connect to https behind a http proxy.
See also https://gitlab.alpinelinux.org/alpine/aports/-/issues/10446
Install GNU wget to resolve this issue.
In addition to the command flag argument we also have the option to set backoff duration and maximum backoff duration using environment variables.
The backoff logic works fine without this but we can introduce the option of configuring deployments with environment variable.
Test plan: Go test
This follows the examples of other languages like Java, Kotlin and Go.
Introduce backoff strategy for failed indexing attempts. The strategy's goal is to backoff subsequent indexing attempts for a given repository for a period of time because we've identified that the preceding indexing attempt has failed. The backoff period of time is finite and increases linearly with consecutive indexing failures.
Test Plan: Add queue tests and backoff duration tests
This field allows us to indicate when the ranks have changed, which
forces us to re-index. We introduce DocumentRankVersion as a builder
options such that it can influence that behavior. By making a
corresponding change to Sourcegraph to set this field we will
automatically update indexes as document ranks change for a repository.
Note: We remove the document ranking feature flag from index server.
This is now completely controlled by frontend.
Note: I got started on a larger refactor here to make this more clean.
The idea is to make DocumentRanksPath a URL which is controlled by
Sourcegraph. We can then use the URL changing to indicate re-indexing
for example. For now this is a much smaller change which will unblock us
sooner.
Test Plan: Ran a local server and tested that when the ranks are missing
nothing happens. When the ranks change or are added we reindex.
Additionally when the ranks do not change it does nothing.
go install ./cmd/zoekt-git-index
go run ./cmd/zoekt-sourcegraph-indexserver \
-sourcegraph_url ~/src/github.com/sourcegraph/ \
-listen 127.0.0.1:6072
# We expect this to trigger an index
echo -e "README.md\t0.1" > ~/src/github.com/sourcegraph/zoekt/SG_DOCUMENT_RANKS
pkill -SIGUSR1 zoekt-sourcegra
# We expect this to do nothing since the ranks have not changed
pkill -SIGUSR1 zoekt-sourcegra
# We expect this to trigger a re-index
echo -e "README.md\t0.2" > ~/src/github.com/sourcegraph/zoekt/SG_DOCUMENT_RANKS
pkill -SIGUSR1 zoekt-sourcegra
In the original zoekt code base each repository has a uint16 called Rank
which is computed from repository signals like star count.
zoekt-sourcegraph-indexserver did not set those signals, instead we
created our own field which is a float64 called priority. In practice
this value is just the star count.
This commit detects empty rank fields and sets them using the same
calculation based on priority. This will make Zoekt's ranking now use
star count as a signal.
Note: when we are ready, we can use more advanced signals as the
input (and code-intel is working on that).
Note: this won't affect current ranking on Sourcegraph since we bucketed
by star count => repos in the same bucket will have the same rank.
Test Plan: Ran zoekt-webserver with an extra log line and observed it
logging Rank values. Additionally rank some searches with "debug=1" set
in the URL params and observed influences on the computed scores.
log.Println("RANK", repo.Name, repo.Rank)
go run ./cmd/zoekt-webserver -listen 127.0.0.1:6070
Heuristics to improve ranking on go files. Weights are based on the
pre-existing heuristics we have for Kotlin and Java.
Test Plan: updated test + experimentation with "debug=1" on
zoekt-webserver.
This option makes it so that streaming search will collect results for
FlushWallTime before sending. This allows a client to provide
best-effort ranking based on time. We hope this will be a sweet spot
between the benefits of streaming latency and relevance of results.
Additionally we also make it so that streaming search respects
MaxDocDisplayCount.
Test Plan: go test
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
This upgrades the golang.org/x/text module to mitigate CVE-2022-32149.
The vulnerability was reported by the Trivy scanner.
See also
https://nvd.nist.gov/vuln/detail/CVE-2022-32149
https://github.com/golang/go/issues/56152
We have been running with eager runDocSection decoding on sourcegraph.com this
week and have seen improvements to tail latency and average latency. We
believe the lazy decoding was unnecessary since we so often do global symbol
searches that we would constantly be decoding doc section data all the time.
Full debug context here: https://sourcegraph.slack.com/archives/C023ELQLV7F/p1672835804462349
Commit 9899a9b3f475ef066ed70c395f8b303268f5d00c changed what the
response looks like when Zoekt has never loaded a shard: it now reports
a `Crashes = 1`, even if everything's fine.
That leads to upstream errors where we show an error message in the
Sourcegraph admin UI because the customer hasn't added any repositories
to their instance yet.
The fix here changes the `loader` to also mark the `shardedSearcher` as
ready if there was nothing to load. Previously the `markReady` was
skipped, the searcher wasn't marked as "ready" and every search query
was replied to with a `Crashes = 1`
Currently taking up 90% of object allocations for sg. This commit
reduces that by 90%. The next step we can take is instead of
returning a map we use a slice.
name old time/op new time/op delta
RepoList_Encode-8 166µs ± 4% 84µs ± 3% -49.12% (p=0.000 n=10+10)
RepoList_Decode-8 519µs ± 2% 148µs ± 1% -71.46% (p=0.000 n=10+10)
name old bytes new bytes delta
RepoList_Encode-8 57.8kB ± 0% 51.1kB ± 0% -11.66% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
RepoList_Encode-8 4.06kB ± 0% 57.34kB ± 0% +1311.02% (p=0.000 n=10+10)
RepoList_Decode-8 316kB ± 0% 259kB ± 0% -17.96% (p=0.000 n=10+7)
name old allocs/op new allocs/op delta
RepoList_Encode-8 1.00k ± 0% 0.00k ± 0% -99.90% (p=0.000 n=10+10)
RepoList_Decode-8 6.59k ± 0% 0.59k ± 0% -91.09% (p=0.000 n=10+10)
Test Plan: go test
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
Co-authored-by: Camden Cheek <camden@ccheek.com>
This is the final commit implementing a non-zero Crashes value while
shardedSearcher is startin up. This is meant to communicate when a zoekt
is partially available. When landed, Sourcegraph's APIs will communicate
that the user may have partial results and should retry.
Test Plan: go test. Additionally added a large sleep to loader.load to
simulate a slow startup. Then did searches in zoekt-webserver and
observed it returning non-zero crash counts until all shards loaded.
If this field is false we will communicate back to the client that we
are partially available. We piggy-back on the Crashes stat since that is
currently used by Sourcegraph to communicate back partial availability.
However, once we stabilize here we should introduce more fine grained
fields.
This commit just introduces reads to the field, the next commit will
introduce the logic which mutates this field which will actually change
behaviour.
Test Plan: go test
I want to introduce some extra state were the order we access it is
important, so we introduce a new structure loaded which represents the
final computed state. For now thiis is a non-functional change, we only
introduce the structure and update call sites. The next commit will add
a field.
Test Plan: go test
With this change, webserver proxies request to `<webserver>/indexserver/` to the unix domain socket at `<index dir>/indexserver.socket`. Since this is a Sourcegraph feature, we put the proxy handler in webserver behind a flag which is toggled off per default.
The motivation is to expose indexserver endpoints on webserver addresses, which in case of Sourcegraph, are easily accessible at runtime.
Sourcegraph's new hybrid searcher sends around large ORs of filenames to
search for. FileNameSet provides an efficient implementation to
transport and query these sets of filenames.
We include an efficient gob implementation based on previous work around
encoding a more complicated map of strings.
Test Plan: Added unit tests. Additionally, updated sourcegraph to use
this implementation and ran its tests.
The current code sets the repo rank of a `rankedShard`. However we don't read `rankedShard.repos` during evaluation, so setting the repo rank had no effect.
Now we set repo rank when we load the shard.
As a consequence, file matches receive a boost (shard-order) based on their shard's priority.
The effective contribution ranges from `[0, scoresShardRankFactor(=20)]`.
* web: quote language query param in link
The language query parameter must be quoted because the
language names may contain whitespaces.
This fixes invalid or empty search results when clicking on a language label
in the search result page when the language contains whitespaces.
* web: don't show matches with line number < 1
Ignore matches with an invalid line number.
E.g when doing a simple file search with `f:\.c$`
a match with line number 0 is returned for each file.
This match is ignored, which also slims down the result list.
We add a few things in this commit. Namely we add generic ranking which
is based on best guesses on what is important across languages. I
believe this is a useful change since it is mostly true and is likely
better than just treating all symbols as equal.
Secondly I add language specific ranking for python, ruby and php. These
are as usualy best guesses based on experience in the language.
Lastly our test cases for TestReadSearch had the language stored in
lower case. I noticed this thanks to the generic scorer running since
previously the go one was not. Just to be safe I made the switch
statements match against both the expected casing and lowered case of
languages.
Test Plan: go test
With this change we recognize changes in capitalization as word boundaries.
There is a corner case which was caught in the tests, where we falsy
classify "nnerClass" as word, because the preceeding letter in the code
is a capital "I". However, I don't believe this corner case is relevant
in practive and by accepting it we keep the code simple.
If we have a MaxDocDisplayCount set, then there is no need sending all
matches found in a shard when at most MaxDocDisplayCount will end up
being in the final result set.
This should improve ranking and let us do more work per shard. I am
slightly worried about negative interactions with TotalMaxMatchCount so
I am feature flagging this behaviour behind UseDocumentRanks.
Test Plan: unit tested without the feature flag.
The install-ctags-alpine.sh silently fails when wget fails to download the
ctags-install-alpine.sh script. The resulting container image is then missing ctags support.
Update the ctags build script to return non-zero on failure.
The root cause is that the busybox wget implementation can not connect to https behind a http proxy.
See also https://gitlab.alpinelinux.org/alpine/aports/-/issues/10446
Install GNU wget to resolve this issue.
Introduce backoff strategy for failed indexing attempts. The strategy's goal is to backoff subsequent indexing attempts for a given repository for a period of time because we've identified that the preceding indexing attempt has failed. The backoff period of time is finite and increases linearly with consecutive indexing failures.
Test Plan: Add queue tests and backoff duration tests
This field allows us to indicate when the ranks have changed, which
forces us to re-index. We introduce DocumentRankVersion as a builder
options such that it can influence that behavior. By making a
corresponding change to Sourcegraph to set this field we will
automatically update indexes as document ranks change for a repository.
Note: We remove the document ranking feature flag from index server.
This is now completely controlled by frontend.
Note: I got started on a larger refactor here to make this more clean.
The idea is to make DocumentRanksPath a URL which is controlled by
Sourcegraph. We can then use the URL changing to indicate re-indexing
for example. For now this is a much smaller change which will unblock us
sooner.
Test Plan: Ran a local server and tested that when the ranks are missing
nothing happens. When the ranks change or are added we reindex.
Additionally when the ranks do not change it does nothing.
go install ./cmd/zoekt-git-index
go run ./cmd/zoekt-sourcegraph-indexserver \
-sourcegraph_url ~/src/github.com/sourcegraph/ \
-listen 127.0.0.1:6072
# We expect this to trigger an index
echo -e "README.md\t0.1" > ~/src/github.com/sourcegraph/zoekt/SG_DOCUMENT_RANKS
pkill -SIGUSR1 zoekt-sourcegra
# We expect this to do nothing since the ranks have not changed
pkill -SIGUSR1 zoekt-sourcegra
# We expect this to trigger a re-index
echo -e "README.md\t0.2" > ~/src/github.com/sourcegraph/zoekt/SG_DOCUMENT_RANKS
pkill -SIGUSR1 zoekt-sourcegra
In the original zoekt code base each repository has a uint16 called Rank
which is computed from repository signals like star count.
zoekt-sourcegraph-indexserver did not set those signals, instead we
created our own field which is a float64 called priority. In practice
this value is just the star count.
This commit detects empty rank fields and sets them using the same
calculation based on priority. This will make Zoekt's ranking now use
star count as a signal.
Note: when we are ready, we can use more advanced signals as the
input (and code-intel is working on that).
Note: this won't affect current ranking on Sourcegraph since we bucketed
by star count => repos in the same bucket will have the same rank.
Test Plan: Ran zoekt-webserver with an extra log line and observed it
logging Rank values. Additionally rank some searches with "debug=1" set
in the URL params and observed influences on the computed scores.
log.Println("RANK", repo.Name, repo.Rank)
go run ./cmd/zoekt-webserver -listen 127.0.0.1:6070
This option makes it so that streaming search will collect results for
FlushWallTime before sending. This allows a client to provide
best-effort ranking based on time. We hope this will be a sweet spot
between the benefits of streaming latency and relevance of results.
Additionally we also make it so that streaming search respects
MaxDocDisplayCount.
Test Plan: go test
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>