Commits
This fixes an annoying problem where after running `go test ./...` locally,
your local git `user.name` would be set to `thomas`.
This adds the before and after Lines in the search results. The default is
still 0 so the user has to increase "context lines" to see more.
Somehow I just found out about `gen-proto.sh` recently, after many months of
working on Zoekt! This PR adds brief `README` files in the proto directories to
guide people (and LLMs) to it.
Small refactor to simplify the logic in `trace.New`. It inlines a couple public
methods that don't need to be exposed separately.
This PR re-enables the go-git optimization that was disabled in https://github.com/sourcegraph/zoekt/pull/870 because it caused a huge bug. This PR removes the part of the optimization that was causing the bug, setting `LargeObjectThreshold`.
An alternative would be to track down the root cause of the bug in go-git, so that we could keep setting `LargeObjectThreshold`. However, I don't feel that's worth it as:
* The memory improvement from this change was pretty small (see https://github.com/sourcegraph/zoekt/pull/854)
* Eventually we want to move away from go-git towards a "vanilla" direct git usage
Make the flag "token" optional. Exit if the user provided token does not
exist (as before). Only, fallback to the token from the default location
if the user did not provide a token. Continue with an unauthenticated
GitHub client (`nil` OAuth client) if the fallback did not work (does
not exist, cannot be read, ...).
This PR fixes a bug where Zoekt would not compile on 32-bit architectures. It
also takes the opportunity to start using the `math` library everywhere instead
of our own constants like `maxUInt32` to help prevent this sort of issue in the
future by encouraging devs to select the most accurate "max" type for their
specific situation.
Closes https://github.com/sourcegraph/zoekt/issues/935
Closes SPLF-913
This only affects Sourcegraph.
This adds Index and Delete methods to
sourcegraph-indexserver's GRPC server.
The methods aren't called yet by Sourcegraph.
Notable differences to our current indexing loop:
- IndexOptions are pushed by the client instead of pulled by Zoekt
- A semaphore restricts concurrency
- New ENV `SRC_STOP_INDEXING` stops the "competing" indexing loop. This is just for
testing and will eventually be removed
- The Index method tries to recover a possibly older index from
`indexDir/.trash` first before starting to index
Review:
I recommend reviewing the `.proto` file first, followed by `main.go`
Test plan:
- I did a bunch of manual testing to test the trash recovery and the
semaphore logic. I successfuly triggered index and delete calls to the
GRPC server. I will put the Insomnia collection I used for testing in the ticket.
- New unit tests
In https://github.com/sourcegraph/zoekt/pull/901 I moved several packages to 'internal' to clean up the exported API. This PR moves the `shards` package back to root, since it contains important methods like `NewDirectorySearcher`. It also renames the `shards` package to `search` to clarify the usage.
Relates to https://github.com/sourcegraph/zoekt/pull/901#issuecomment-2703171432, https://github.com/sourcegraph/zoekt/discussions/927
The gatherBranches() function shifts a mask of type uint64 until it is 0, but
uses an id for accessing a map of type uint32. This results in empty branch
names getting returned from this function. It can be fixed by changing the type
of id to also be uint64.
The latest release of gopls has a feature called modernize which will
update your code where it can to use modern go features/pkgs.
https://github.com/golang/tools/releases/tag/gopls%2Fv0.18.0
Generated with:
go run golang.org/x/tools/gopls/internal/analysis/modernize/cmd/modernize@latest -fix -test ./...
Test Plan: CI
In testing, I noticed another problem with BM25: sometimes a binary file is ranked highly because of a match on its filename. In classic Zoekt scoring, these are ranked low because they are skipped, and we always sort skipped docs to the end of the index.
This PR ensures they're also ranked low for BM25 by adding a 'binary' category, and marking it low priority. Adding this category required updating `SkipReason` to track the reason a document was skipped. This is necessary because we set the content of skipped docs to `nil`, and `SkipReason` is the only lasting indication that it was binary.
Relates to #920
Relates to SPLF-874
This implements DeleteAllData. We hold the global lock while deleting all simple shards belonging to a tenant. We also handle compound shards by disassembling them first.
Note that this "only" deletes persisted data. Updating the queue, for example, seems fragile because it might immediately get updated by Sourcegraph. This implies that Sourcegraph first has to delete the tenant in the Sourcegraph DB first and then call this new endpoint. Even if the queue still has a reference to a deleted tenant, indexserver won't be able to retrieve index options or clone the repo from gitserver.
Test plan:
- new unit tests
- manual testing:
I ran this together with Sourcegraph and triggered a delete by calling DeleteAllData directly. I confirmed that all shards, including compound shards are deleted.
This PR reworks the way we incorporate file signals into BM25. Previously, we were applying them as a tie-breaker. But in dogfooding, we found that these rarely impact results, because it's so rare to have a tie in BM25 scores.
Now, we take the file signal into account when computing BM25F. The interpretation is that this data lives in a separate 'field' that is half the priority of regular content.
Relates to SPLF-874
This adds a grpc server to sourcegraph-indexserver. For now it supports just one method.
The diff is quite big, so I left comments to mark the most important bits.
I used the opportunity to clean up a bit (=> hence the big diff):
- Reuse grpc logic from webserver and move those bits to "/gprc/..."
- Move "protos" inside the new "grpc" directory (=> requires changes of import statements in Sourcegraph)
- Refactor import aliases for grpc packages across the codebase
Test plan:
I tested this locally by calling the new grpc endpoint directly.
This was left over from our PageRank prototype.
With this change we recognize boosted queries in our bm25 scoring and
adjust the overall score accordingly.
We need to take care of 2 parts: The overall bm25 score of the document,
and the line score determining the order in which we return the chunks.
Co-authored-by: Julie Tibshirani <julietibs@apache.org>
We add a new section to shards which contains a roaring bitmap for quickly checking if a shard contains a repo ID. We then can load just this (small amount) of data to rule out a compound shard. We use roaring bitmaps since we already have that dependency in our codebase.
The reason we speed up this operation is we found on a large instance which contained thousands of tiny repos we spent so much time in findShard that our indexing queue would always fall behind.
It is possible this new section won't speed this up enough and we need some sort of global oracle (or in-memory cache in indexserver?). This is noted in the code for future travellers.
Test Plan: the existing unit tests already cover if this is forwards and backwards compatible. Additionally I added some logging to zoekt to test if older version of shards still work correctly in findShard, as well as if older versions of zoekt can read the new shards.
Added a benchmark to check the impact. See comments in the code.
---------
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
This adds repo freshness and file order as tiebreakers to the final
bm25 score, just like we have for Zoekt's default scoring.
During testing I found that it is a lot less likely for the tiebreakers
to have an effect with BM25 because the score depends on qualites of the
document, such as the relative length and number of matches, which
usually differ even with the quality of the match is similar.
Test plan:
- Score tests still pass
- manual testing: see screenshots
We remove IDF from our BM25 scoring, effectively treating it as constant.
This is supported by our evaluations which showed that for keyword style queries, IDF can down-weight the score of important keywords too much, leading to a worse ranking. The intuition is that for code search, each keyword is important independently of how frequent it appears in the corpus.
Removing IDF allows us to apply BM25 scoring to a wider range of query types. Previously, BM25 was limited to queries with individual terms combined using OR, as IDF was calculated on the fly at query time.
Test plan:
updated tests
Fixes a bug I introduced in #891.
When skipping a doc, we currently report the detected language as "binary" (if
it looks like binary) or "skipped" (if it's skipped for any other reason).
Skipped docs are still added to the index and can still be returned as search
results, for example if you only match on filename. So sometimes file matches
are returned with "skipped" as their language, even though the file path is
clearly some other language like XML.
This PR updates the indexing logic to still detect the language even if the
document is skipped. However, we avoid passing the contents to the language
detection library to avoid running detection on huge files.
---------
Co-authored-by: Julie Tibshirani <julietibs@apache.org>
We added support for Windows in https://github.com/sourcegraph/zoekt/pull/535. We then partially reverted the change, specifically the parts related to mmap (#706). This was okay because we no longer build for Windows.
This PR fully removes support to avoid being in a partially-implemented state.
When navigating the code, I've often forgotten the difference between
`NewBuilder` and `NewIndexBuilder`. This rename clarifies that one of these
indexes a whole repo, while the other builds individual shards. Also
`index.NewShardBuilder` sounds better.
Follow up to #905. I realized the README looked much better with a title!
This PR updates the README to clarify Zoekt's current design, and explain the main usage patterns.
When adding a `format=raw` to a request `/print`, we receive the raw Content data file in response
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
This PR adds doc comments for all packages/ commands.
In the repo root, we have a bunch of low level logic around index building and
searching. So we end up exposing internal logic through the main public `zoekt`
package, for example `zoekt.Merge(...)`.
This PR moves it into the `build` package, so all code related to index
building lives together. It then renames `build` to `index` to reflect the
broader focus on indexing and searching the index.
This PR moves the following packages to `internal` to avoid exposing them in the API:
* `ctags`
* `debugserver`
* `gitindex`
* `shards`
* `trace`
This removes the transitive dependency avo... which also removes the
global registration of the cpuprofile flag.
This is more a workaround since a transitive dependency has introduced a
global flag "cpuprofile", leading to a panic due to registring the flag
twice.
To make ourselves immune to this issue we can refactor our usages to use
a FlagSet, even for "main". This is a bigger and frankly inconvenient
change for a somewhat rare occurance. Instead we just rename our flag.
I feel comfortable renaming since this flag should only really be used
by Zoekt developers. There will be the issue that the flag will be shown
twice for commands, but I will report to the upstream repo about this
problem.
Test Plan: go get -u ./... && go run ./cmd/zoekt-git-index works
This fixes the failing 'fuzz test' check in CI. Example from https://github.com/sourcegraph/zoekt/pull/891:
```
Download action repository 'jidicula/go-fuzz-action@0206b61afc603b665297621fa5e691b1447a5e57' (SHA:0206b61afc603b665297621fa5e691b1447a5e57)
Getting action download info
Error: This request has been automatically failed because it uses a deprecated version of `actions/upload-artifact: v3`. Learn more: https://github.blog/changelog/2024-04-16-deprecation-notice-v3-of-the-artifact-actions/
```
Currently, we only record the number of compound shards. This PR adds a metric
for the total number of index shards. This metric is helpful for alerting, as
it's important to catch a significant decrease in the number of indexed data.
Follow up to https://github.com/sourcegraph/zoekt/pull/888, where I forgot to improve the test coverage.
Usually, if there are candidate matches with overlapping ranges, then we just remove matches that overlap. However, when `opts.ChunkMatches = false`, we had special logic to merge overlapping matches.
This PR removes the overlapping logic to simplify the behavior. I couldn't see a good reason to keep this special handling. Plus, we are moving towards making `ChunkMatches` the default.
Another benefit of this change is that it makes the BM25 behavior easier to understand. If we merged together ranges, then we would be calculating term frequencies for spurious terms (like `new`, `queue`, `newqueue`, `queuenew`, etc.) Note: we currently only use BM25 with `ChunkMatches = true`, so there's not an active bug here.
Currently, BM25 scoring only applies to the overall `FileMatch` score. The
algorithm gathered term frequencies from all candidate matches in the file to
produce a file-level score. However `LineMatch` and `ChunkMatch` scores were
still calculated using the classic Zoekt scoring algorithm.
This PR implements BM25 scoring for `LineMatch` and `ChunkMatch`. It does so by
calculating a BM25 per line. Compared to the classic Zoekt algorithm, this
rewards multiple term matches on a line. Because our term frequency calculation
also boosts symbol matches, the score smoothly balances between "many term
matches" and "interesting term matches".
Now, the code is structured as follows:
* `scoreChunk`: goes through each line in the chunk, calculating its score
through `scoreLine`, and returns the best-scoring line
* `scoreLine`: calculates the score for a single line
The mental model is that "the score of a chunk is always the score of its best
line".
This PR adds a new field `ChunkMatch.BestLineMatch` with the line number of top-scoring line in the chunk. This will let us address a long-standing issue with our new flexible keyword search, where chunk matches can become very large. Since our search results UX only shows the start of a chunk, the most relevant line may not even be visible. With this information on the best line match, we can adjust the search results UX to center the chunk on the most relevant line.
Relates to [SPLF-188](https://linear.app/sourcegraph/issue/SPLF-188/ensure-the-best-scoring-line-match-is-shown-in-search-results)
Add support to the GitLab mirror for excluding archived repos.
We had 2 copies of this logic in the code and this bit me when I
introduced id-based shards
indexbuilder.go seems an "ok" place to put. We cannot put it in the
shards package because of circular dependencies.
Test plan:
refactor, so relying on CI
Co-authored-by: Sourcegraph <batch-changes@sourcegraph.com>
When digging into our Natural Language Search (NLS) eval results, I found that one of the leading causes for flexible search types like "Fuzzy symbol search" and "Find logic" was noisy matches in top results. Currently, our BM25 ranking rewards any substring match equally. So for queries like 'extract tar', any match on 'tar' (even within unrelated terms like 'start', etc.) counts towards the term frequency.
This PR helps reduce noise by boosting symbol matches the same as we do filename matches. Our NLS evals show positive improvement, and context evals are the tiniest bit better.
This is an alternative to #875.
We run the health check with system priviledges. This way we run an
actual search, just like we do if tenant enforcement is off.
I also make sure we don't log system searches as "missing_tenant".
If we have lots of work done we start to truncate in net/trace. So
display information more succinctly.
Test Plan: CI doesn't complain should be good enough
I was inspecting logs in GKE and it incorrectly categorized the severity
of nearly all logs from zoekt-webserver and zoekt-indexserver. This is a
hack to make it work better without putting in the bigger work of
migrating us to structured logging.
Test Plan: go test
We have seen issues on large repos so lets default to off until we fix
again.
Before, we'd typically have two newlines, one coming from the corpus,
and one terminating the fmt.Printf call.
They are interpreted as AND, ie.
zoekt a b c
is equivalent to
zoekt "a b c"
The logging is neat for debugging, but is distracting when using
cmd/zoekt as a grep replacement.
It is hard to tell from net/trace how long we spend evaluating type:repo
sub-expressions. This adds in extra logging for when we do that, which
should help understand time taken to do this.
Test Plan: go test
This logging is useful since we directly call this search (and avoid
typeRepoSearcher) when evaluating List calls for type:repo searches. By
adding in this logging we can understand the performance of potentially
expensive queries here.
This PR re-enables the go-git optimization that was disabled in https://github.com/sourcegraph/zoekt/pull/870 because it caused a huge bug. This PR removes the part of the optimization that was causing the bug, setting `LargeObjectThreshold`.
An alternative would be to track down the root cause of the bug in go-git, so that we could keep setting `LargeObjectThreshold`. However, I don't feel that's worth it as:
* The memory improvement from this change was pretty small (see https://github.com/sourcegraph/zoekt/pull/854)
* Eventually we want to move away from go-git towards a "vanilla" direct git usage
Make the flag "token" optional. Exit if the user provided token does not
exist (as before). Only, fallback to the token from the default location
if the user did not provide a token. Continue with an unauthenticated
GitHub client (`nil` OAuth client) if the fallback did not work (does
not exist, cannot be read, ...).
This PR fixes a bug where Zoekt would not compile on 32-bit architectures. It
also takes the opportunity to start using the `math` library everywhere instead
of our own constants like `maxUInt32` to help prevent this sort of issue in the
future by encouraging devs to select the most accurate "max" type for their
specific situation.
Closes https://github.com/sourcegraph/zoekt/issues/935
Closes SPLF-913
This only affects Sourcegraph.
This adds Index and Delete methods to
sourcegraph-indexserver's GRPC server.
The methods aren't called yet by Sourcegraph.
Notable differences to our current indexing loop:
- IndexOptions are pushed by the client instead of pulled by Zoekt
- A semaphore restricts concurrency
- New ENV `SRC_STOP_INDEXING` stops the "competing" indexing loop. This is just for
testing and will eventually be removed
- The Index method tries to recover a possibly older index from
`indexDir/.trash` first before starting to index
Review:
I recommend reviewing the `.proto` file first, followed by `main.go`
Test plan:
- I did a bunch of manual testing to test the trash recovery and the
semaphore logic. I successfuly triggered index and delete calls to the
GRPC server. I will put the Insomnia collection I used for testing in the ticket.
- New unit tests
In https://github.com/sourcegraph/zoekt/pull/901 I moved several packages to 'internal' to clean up the exported API. This PR moves the `shards` package back to root, since it contains important methods like `NewDirectorySearcher`. It also renames the `shards` package to `search` to clarify the usage.
Relates to https://github.com/sourcegraph/zoekt/pull/901#issuecomment-2703171432, https://github.com/sourcegraph/zoekt/discussions/927
The latest release of gopls has a feature called modernize which will
update your code where it can to use modern go features/pkgs.
https://github.com/golang/tools/releases/tag/gopls%2Fv0.18.0
Generated with:
go run golang.org/x/tools/gopls/internal/analysis/modernize/cmd/modernize@latest -fix -test ./...
Test Plan: CI
In testing, I noticed another problem with BM25: sometimes a binary file is ranked highly because of a match on its filename. In classic Zoekt scoring, these are ranked low because they are skipped, and we always sort skipped docs to the end of the index.
This PR ensures they're also ranked low for BM25 by adding a 'binary' category, and marking it low priority. Adding this category required updating `SkipReason` to track the reason a document was skipped. This is necessary because we set the content of skipped docs to `nil`, and `SkipReason` is the only lasting indication that it was binary.
Relates to #920
Relates to SPLF-874
This implements DeleteAllData. We hold the global lock while deleting all simple shards belonging to a tenant. We also handle compound shards by disassembling them first.
Note that this "only" deletes persisted data. Updating the queue, for example, seems fragile because it might immediately get updated by Sourcegraph. This implies that Sourcegraph first has to delete the tenant in the Sourcegraph DB first and then call this new endpoint. Even if the queue still has a reference to a deleted tenant, indexserver won't be able to retrieve index options or clone the repo from gitserver.
Test plan:
- new unit tests
- manual testing:
I ran this together with Sourcegraph and triggered a delete by calling DeleteAllData directly. I confirmed that all shards, including compound shards are deleted.
This PR reworks the way we incorporate file signals into BM25. Previously, we were applying them as a tie-breaker. But in dogfooding, we found that these rarely impact results, because it's so rare to have a tie in BM25 scores.
Now, we take the file signal into account when computing BM25F. The interpretation is that this data lives in a separate 'field' that is half the priority of regular content.
Relates to SPLF-874
This adds a grpc server to sourcegraph-indexserver. For now it supports just one method.
The diff is quite big, so I left comments to mark the most important bits.
I used the opportunity to clean up a bit (=> hence the big diff):
- Reuse grpc logic from webserver and move those bits to "/gprc/..."
- Move "protos" inside the new "grpc" directory (=> requires changes of import statements in Sourcegraph)
- Refactor import aliases for grpc packages across the codebase
Test plan:
I tested this locally by calling the new grpc endpoint directly.
With this change we recognize boosted queries in our bm25 scoring and
adjust the overall score accordingly.
We need to take care of 2 parts: The overall bm25 score of the document,
and the line score determining the order in which we return the chunks.
Co-authored-by: Julie Tibshirani <julietibs@apache.org>
We add a new section to shards which contains a roaring bitmap for quickly checking if a shard contains a repo ID. We then can load just this (small amount) of data to rule out a compound shard. We use roaring bitmaps since we already have that dependency in our codebase.
The reason we speed up this operation is we found on a large instance which contained thousands of tiny repos we spent so much time in findShard that our indexing queue would always fall behind.
It is possible this new section won't speed this up enough and we need some sort of global oracle (or in-memory cache in indexserver?). This is noted in the code for future travellers.
Test Plan: the existing unit tests already cover if this is forwards and backwards compatible. Additionally I added some logging to zoekt to test if older version of shards still work correctly in findShard, as well as if older versions of zoekt can read the new shards.
Added a benchmark to check the impact. See comments in the code.
---------
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
This adds repo freshness and file order as tiebreakers to the final
bm25 score, just like we have for Zoekt's default scoring.
During testing I found that it is a lot less likely for the tiebreakers
to have an effect with BM25 because the score depends on qualites of the
document, such as the relative length and number of matches, which
usually differ even with the quality of the match is similar.
Test plan:
- Score tests still pass
- manual testing: see screenshots
We remove IDF from our BM25 scoring, effectively treating it as constant.
This is supported by our evaluations which showed that for keyword style queries, IDF can down-weight the score of important keywords too much, leading to a worse ranking. The intuition is that for code search, each keyword is important independently of how frequent it appears in the corpus.
Removing IDF allows us to apply BM25 scoring to a wider range of query types. Previously, BM25 was limited to queries with individual terms combined using OR, as IDF was calculated on the fly at query time.
Test plan:
updated tests
When skipping a doc, we currently report the detected language as "binary" (if
it looks like binary) or "skipped" (if it's skipped for any other reason).
Skipped docs are still added to the index and can still be returned as search
results, for example if you only match on filename. So sometimes file matches
are returned with "skipped" as their language, even though the file path is
clearly some other language like XML.
This PR updates the indexing logic to still detect the language even if the
document is skipped. However, we avoid passing the contents to the language
detection library to avoid running detection on huge files.
---------
Co-authored-by: Julie Tibshirani <julietibs@apache.org>
In the repo root, we have a bunch of low level logic around index building and
searching. So we end up exposing internal logic through the main public `zoekt`
package, for example `zoekt.Merge(...)`.
This PR moves it into the `build` package, so all code related to index
building lives together. It then renames `build` to `index` to reflect the
broader focus on indexing and searching the index.
This is more a workaround since a transitive dependency has introduced a
global flag "cpuprofile", leading to a panic due to registring the flag
twice.
To make ourselves immune to this issue we can refactor our usages to use
a FlagSet, even for "main". This is a bigger and frankly inconvenient
change for a somewhat rare occurance. Instead we just rename our flag.
I feel comfortable renaming since this flag should only really be used
by Zoekt developers. There will be the issue that the flag will be shown
twice for commands, but I will report to the upstream repo about this
problem.
Test Plan: go get -u ./... && go run ./cmd/zoekt-git-index works
This fixes the failing 'fuzz test' check in CI. Example from https://github.com/sourcegraph/zoekt/pull/891:
```
Download action repository 'jidicula/go-fuzz-action@0206b61afc603b665297621fa5e691b1447a5e57' (SHA:0206b61afc603b665297621fa5e691b1447a5e57)
Getting action download info
Error: This request has been automatically failed because it uses a deprecated version of `actions/upload-artifact: v3`. Learn more: https://github.blog/changelog/2024-04-16-deprecation-notice-v3-of-the-artifact-actions/
```
Usually, if there are candidate matches with overlapping ranges, then we just remove matches that overlap. However, when `opts.ChunkMatches = false`, we had special logic to merge overlapping matches.
This PR removes the overlapping logic to simplify the behavior. I couldn't see a good reason to keep this special handling. Plus, we are moving towards making `ChunkMatches` the default.
Another benefit of this change is that it makes the BM25 behavior easier to understand. If we merged together ranges, then we would be calculating term frequencies for spurious terms (like `new`, `queue`, `newqueue`, `queuenew`, etc.) Note: we currently only use BM25 with `ChunkMatches = true`, so there's not an active bug here.
Currently, BM25 scoring only applies to the overall `FileMatch` score. The
algorithm gathered term frequencies from all candidate matches in the file to
produce a file-level score. However `LineMatch` and `ChunkMatch` scores were
still calculated using the classic Zoekt scoring algorithm.
This PR implements BM25 scoring for `LineMatch` and `ChunkMatch`. It does so by
calculating a BM25 per line. Compared to the classic Zoekt algorithm, this
rewards multiple term matches on a line. Because our term frequency calculation
also boosts symbol matches, the score smoothly balances between "many term
matches" and "interesting term matches".
Now, the code is structured as follows:
* `scoreChunk`: goes through each line in the chunk, calculating its score
through `scoreLine`, and returns the best-scoring line
* `scoreLine`: calculates the score for a single line
The mental model is that "the score of a chunk is always the score of its best
line".
This PR adds a new field `ChunkMatch.BestLineMatch` with the line number of top-scoring line in the chunk. This will let us address a long-standing issue with our new flexible keyword search, where chunk matches can become very large. Since our search results UX only shows the start of a chunk, the most relevant line may not even be visible. With this information on the best line match, we can adjust the search results UX to center the chunk on the most relevant line.
Relates to [SPLF-188](https://linear.app/sourcegraph/issue/SPLF-188/ensure-the-best-scoring-line-match-is-shown-in-search-results)
When digging into our Natural Language Search (NLS) eval results, I found that one of the leading causes for flexible search types like "Fuzzy symbol search" and "Find logic" was noisy matches in top results. Currently, our BM25 ranking rewards any substring match equally. So for queries like 'extract tar', any match on 'tar' (even within unrelated terms like 'start', etc.) counts towards the term frequency.
This PR helps reduce noise by boosting symbol matches the same as we do filename matches. Our NLS evals show positive improvement, and context evals are the tiniest bit better.