Commits
ShardBuilder.Add still determined language before file category for callers
that bypass Builder.Add. In that path skipped content can be replaced with the
not-indexed marker, so doing language first leaves category detection
operating on synthetic content instead of the original file and misses the
cheaper skip-aware language path.
This changes ShardBuilder.Add to determine the file category before rewriting
skipped content, then infer language afterward so direct ShardBuilder callers
follow the same behavior as Builder.Add and keep content-aware categorization
for skipped documents.
Shard merging also reconstructs documents through ShardBuilder, so this PR
carries the category already stored in the source shard into the rebuilt
document. That keeps merge metadata-preserving instead of forcing category
inference to rediscover information the original indexer already knew.
Truncate failure messages to 12 KiB before sending index status updates to Sourcegraph. This limit is modeled after Kubernetes, which similarly truncates termination messages on the assumption that complete failure details are available in container logs.
Add repository state and failure message fields to UpdateIndexStatusRequest.
Reported by "trivy fs go.mod". Resolved by upgrading the minimum
versions:
go get github.com/go-git/go-billy/v5@v5.9.0 github.com/go-git/go-git/v5@v5.19.0
Adds query.Regexp.RegexpString() as the public API for getting the marshaled raw regexp pattern. Existing query.Regexp string, proto and gob serialization paths now route through that method.
The default for zoekt-git-index has always been true for incremental.
However, we originally used to use zoekt-archive-index which defaulted
to false for incremental. This meant that since we stopped using archive
index (5 years ago!) forcing non-incremental indexing has always lead to
incremental indexing.
Bitbucket Cloud and Azure DevOps repositories need host-specific URL templates so indexed search results can link back to commits and file locations in their native web UIs. The Azure file template uses the commit-version prefix because Zoekt stores commit SHAs for branch versions, avoiding a branch-only URL form.
Test Plan: go test ./gitindex -run TestSetTemplates
While searching for a short string of text, it returned some TTF files.
I downloaded one from the source repo and checked it locally with a hex
viewer to confirm. A bit of poking around this logic bug.
I found a second more recent instance of the same bug that seems to be
a copy/paste error.
Main use case is so we can sanity check the built binary.
Co-authored-by: Sourcegraph Egg <egg@sourcegraph.com>
Many of the actions in our workflows were several major versions behind
their current releases. Stale action versions miss security fixes,
runtime improvements, and node runtime upgrades that newer GitHub
runners increasingly require.
This bumps each action to the current latest stable major version while
keeping the existing pinning style (major-version tag for first-party
or well-known actions, full SHA for the third-party fuzz action).
Test Plan: CI on the resulting PR will exercise every updated workflow.
Allow user to set the name via the meta option.
Trivy flagged vulns CVEs in go-git, otel and grpc. We upgraded to the
minimum version which fixes the vulns. For otel we made sure we kept the
version consistent across modules.
We had the same pagination logic repeated 3 times. Factor it out.
In our production sourcegraph.com cluster we are regularly seeing git
grow to over 9GB in size leading to OOMs. Once we disabled this flag the
OOMs went away. For now lets default this feature to off until it is
resolved.
It would be useful to know the query which caused a shard to crash.
If you set ZOEKT_DISABLE_GOGIT_OPTIMIZATION=true then zoekt-git-index
would fail on bare repositories. This was a regression from
a0f5789 Handle git indexing path correctly for Git worktrees
As such we also increase our test coverage to handle bare vs normal vs
worktrees. Additionally we add a test for when we disable the
optimization for bare repos, specifically targetting the regression we
saw.
Previously we hardcoded this value to 1mb (which aligns with what is set
in Sourcegraph). But we might as well set it based on the limit that is
passed in.
Additionally we set the special HTTP headers sourcegraph needs in the
config so that when git double checks if it needs to hydrate an object
it can do it successfully.
At Sourcegraph we do a sparse clone which excludes files based on max
file size. However, cat-file will hydrate in missing objects. So we pass
in the same filter to avoid hydrating in those files.
We recently had some crashes in production while experimenting with
ZOEKT_RE2_THRESHOLD_BYTES and found it hard to consume the stack traces
since they were over multiple lines.
`gitindex` assumed repositories could be opened with the default go-git repo opener. That works for normal clones, but it breaks for Git worktrees because refs and object data are split between the worktree git dir and the shared common git dir.
This change introduces a small `plainOpenRepo` helper that uses `git`.
`PlainOpenWithOptions` with `DetectDotGit` and `EnableDotGitCommonDir`. The helper is then used in the plain-open code paths inside `gitindex`, including config/template loading.
With this change, Zoekt can read Git worktrees through go-git using the same logical repository view as Git itself, instead of failing on missing refs or incomplete repository layouts.
* gitindex: replace go-git blob reading with pipelined git cat-file --batch
Replace the serial go-git BlobObject calls in indexGitRepo with a single
pipelined "git cat-file --batch --buffer" subprocess. A writer goroutine
feeds all blob SHAs to stdin while the main goroutine reads responses
from stdout, forming a concurrent pipeline that eliminates per-object
packfile seek overhead and leverages git's internal delta base cache.
Submodule blobs fall back to the existing go-git createDocument path.
Benchmarked on kubernetes (29,188 files, 261 MB), Apple M1 Max, 5 runs:
go-git BlobObject (before):
Time: 2.94s Allocs: 685K Memory: 691 MB
cat-file pipelined (after):
Time: 0.60s Allocs: 58K Memory: 276 MB
Speedup: 4.9x time, 12x fewer allocs, 2.5x less memory
* gitindex: streaming catfileReader API, skip large blobs without reading
Replace the bulk readBlobsPipelined (which read all blobs into a
[]blobResult slice) with a streaming catfileReader modeled after
archive/tar.Reader:
cr, _ := newCatfileReader(repoDir, ids)
for {
size, missing, err := cr.Next()
if size > maxSize { continue } // auto-skipped, never read
content := make([]byte, size)
io.ReadFull(cr, content)
}
Next() reads the cat-file header and returns the blob's size. The
caller decides whether to Read the content or skip it — calling Next()
again automatically discards unread bytes via bufio.Reader.Discard.
Large blobs over SizeMax are never allocated or read into Go memory.
Also split the single interleaved loop into two: one for main-repo
blobs streamed via cat-file, one for submodule blobs via go-git's
createDocument. The builder sorts documents internally so ordering
between the loops does not matter.
Peak memory is now bounded by ShardMax (one shard's worth of content)
rather than total repository size.
* gitindex: harden catfileReader Close, add kill switch and SkipReasonMissing
Address review feedback on PR #1021:
- Make Close() idempotent via sync.Once; kill the git process first
(matching Gitaly's pattern) instead of draining all remaining stdout,
so early termination is fast. Suppress the expected SIGKILL exit error.
Add defer close(writeErr) in the writer goroutine to prevent deadlock
on double-close.
- Change Next() return and pending field from int64 to int, use
strconv.Atoi. Removes casts at all call sites; SizeMax is already int.
- Add SkipReasonMissing for blobs that git cat-file reports as missing,
instead of reusing SkipReasonTooLarge. Missing is unexpected for local
repos (corruption, shallow clone, gc race) so log a warning.
- Extract indexCatfileBlobs() with defer cr.Close(), eliminating four
manual Close() calls on error paths.
- Add ZOEKT_DISABLE_CATFILE_BATCH env var kill switch following the
existing ZOEKT_DISABLE_GOGIT_OPTIMIZATION pattern. When set, all blobs
fall back to the go-git createDocument path.
- Deduplicate skippedLargeDoc/skippedMissingDoc into skippedDoc(reason).
- Add 19 hardening tests covering Close lifecycle (double close,
concurrent close, early termination), Read edge cases (partial reads,
1-byte buffer, empty blobs, read-without-next), missing object
sequences, large blob byte precision, and duplicate SHAs.
Benchmarked on kubernetes (29,188 files): no performance regression
(geomean -0.89%, within noise).
* index: reduce GC pressure in posting list construction by 1.8x
Three targeted changes to newSearchableString, the hot path where ~54%
of CPU was spent on runtime memory management (memclr, memmove, madvise,
mapassign):
1. Merge postings and lastOffsets maps into a single map[ngram]*postingList.
Pointer-stored values mean the map is only written when a new ngram is
first seen (~282K writes for kubernetes) rather than on every trigram
occurrence (~169M). This cuts per-trigram map operations from 4 to ~1.
2. Pre-size the map using estimateNgrams(shardMaxBytes) and pre-allocate
each posting list to 1024 bytes, reducing slice growth events and
eliminating map rehashing.
3. Pool postingsBuilder instances via sync.Pool on the Builder, so
sequential and parallel shard builds reuse map and slice allocations
across shards instead of re-creating them.
Benchmarked on kubernetes (22.9K files, 169 MB, Apple M1 Max):
Cold path (NewSearchableString):
Time: 9.3s → 5.3s (-43%)
Allocs: 901K → 677K (-25%)
B/op: 1358 → 1536 MB (+13%, from larger pre-alloc)
Warm path (pooled reuse across shards):
Time: 9.3s → 5.1s (-45%)
Allocs: 901K → 23K (-97%)
B/op: 1358 → 0.6 MB (-99.96%)
WritePostings: ~155ms, unchanged.
* index: direct-indexed array for ASCII trigrams, 3.7x faster builds
Replace map lookups with a direct-indexed [2M]*postingList array for
ASCII trigrams (3 × 7-bit runes = 21-bit index, 16 MB of pointers).
Since 99%+ of trigrams in source code are pure ASCII, this eliminates
nearly all hash and probe overhead from the hot loop. Non-ASCII
trigrams still fall back to the map.
Also inline the ASCII check (data[0] < utf8.RuneSelf) to avoid
utf8.DecodeRune function call overhead on the 95-99% of bytes that
are ASCII.
In writePostings, collect ngrams from both the array and map into a
single sorted slice for writing. The on-disk format is unchanged.
Benchmarked on kubernetes (22.9K files, 169 MB, Apple M1 Max):
Cold path (NewSearchableString):
Before (with map opt): 5.3s, 677K allocs, 1536 MB
After: 2.5s, 676K allocs, 1544 MB
Speedup: 2.1x (3.7x vs original baseline)
Warm path (pooled reuse):
Before (with map opt): 5.1s, 23K allocs, 0.6 MB
After: 2.3s, 23K allocs, 0.6 MB
Speedup: 2.2x (4.0x vs original baseline)
WritePostings: ~130ms, unchanged.
The non-ASCII map now holds only ~6K entries (vs ~282K before), since
the vast majority of trigrams are served by the direct array.
* index: fix stale comments after ASCII array introduction
Update three comments that still referenced map-only storage after the
direct-indexed ASCII array was added: postingList doc, estimateNgrams
doc, and reset() doc.
* index: address review feedback from keegancsmith
- Replace putPostingsBuilder with returnPostingsBuilders(*ShardBuilder)
that returns both content and name builders to the pool and nils the
fields, so any subsequent misuse panics obviously.
- Drop error-path pool returns in newShardBuilder: setRepository errors
are extremely rare (invalid templates or >64 branches), not worth the
code complexity.
- Thread shardMax through newShardBuilder(shardMax int). Callers without
a shard size context (merge, tests, public API) pass 0 for the default
(100 MB). Builder.newShardBuilder passes b.opts.ShardMax via the pool.
- Switch sort.Slice to slices.SortFunc in writePostings for type safety
and to avoid interface boxing overhead.
* index: sparse ASCII index, reduce initialPostingCap to 64
Address review feedback:
- Add asciiPopulated []uint32 sparse index so reset() and
writePostings iterate only populated slots (~275K) instead
of scanning all 2M. Retains postingList allocations for
pool reuse via len(pl.data)==0 detection on the hot path.
- Reduce initialPostingCap from 1024 to 64. On kubernetes
(282K trigrams), median posting list is 10 bytes and 78%
are under 64. Drops pre-allocation waste from 244 MB to
11 MB. Cold-path B/op: 1558 MB → 1352 MB (-13%).
Adds an optional hybrid regex engine (internal/hybridre2) that transparently
switches between grafana/regexp and wasilibs/go-re2 (RE2 via WebAssembly)
based on file content size. Disabled by default — no behaviour change
without opt-in via ZOEKT_RE2_THRESHOLD_BYTES.
## Motivation
Issue #323 identified regex as the dominant CPU consumer in zoekt's
webserver profile. Go's regexp engine (including the grafana/regexp fork
already in use) lacks a lazy DFA. RE2's lazy DFA provides linear-time
matching with much better constant factors for alternations, character
classes, and complex patterns on large inputs.
The tradeoff: go-re2 uses WebAssembly (~600ns per-call overhead), making
it slower than grafana/regexp for small inputs (<4KB) but dramatically
faster above the threshold. A full engine swap would regress small-file
searches, so a threshold-based hybrid is the pragmatic approach.
## Implementation
### New package: internal/hybridre2
hybridre2.Regexp compiles both engines once at query-parse time and
dispatches FindAllIndex based on len(input) >= Threshold():
func (re *Regexp) FindAllIndex(b []byte, n int) [][]int {
if useRE2(len(b)) {
return re.re2.FindAllIndex(b, n)
}
return re.grafana.FindAllIndex(b, n)
}
### Change to index/matchtree.go
regexpMatchTree gains a hybridRegexp field used for file content matching;
filename matching keeps using grafana/regexp directly (filenames are always
short, so WASM overhead dominates there).
### Configuration
ZOEKT_RE2_THRESHOLD_BYTES env var, read once at startup:
-1 (default): disabled — always grafana/regexp, zero behaviour change
0: always use go-re2 (useful for evaluation/testing)
32768: use go-re2 for files >= 32KB (recommended starting point)
## Benchmarks
Hardware: AMD EPYC 9B14, go-re2 v1.10.0 (WASM, no CGO).
Alternations — `func|var|const|type|import`:
32KB: grafana 2505µs go-re2 467µs 5.4x speedup
128KB: grafana 9900µs go-re2 1699µs 5.8x speedup
512KB: grafana 40.7ms go-re2 6.8ms 6.0x speedup
Complex — `(func|var)\s+[A-Z]\w*\s*(`:
32KB: grafana 1237µs go-re2 230µs 5.4x speedup
128KB: grafana 4935µs go-re2 911µs 5.4x speedup
512KB: grafana 19.9ms go-re2 3.8ms 5.3x speedup
Literal — `main` (grafana wins; threshold protects this case):
32KB: grafana 33.2µs go-re2 59.8µs
## Testing
go test ./internal/hybridre2/ # unit + correctness matrix
go test ./index/ -short # full existing suite: passes
go test ./... -short # full suite: passes
Correctness verified by asserting identical match offsets between grafana
and go-re2 for 9 patterns x 5 sizes (64B-256KB).
## Notes
- Binary/non-UTF-8 content: go-re2 stops at invalid UTF-8 (vs. grafana
which replaces with the replacement character). The default threshold of
-1 ensures zero behaviour change. Operators enabling the threshold should
be aware; future work could detect non-UTF-8 and force the grafana path.
- Dependency: github.com/wasilibs/go-re2 v1.10.0 — pure Go WASM, no system
deps. Binary size increase: ~2MB (the embedded RE2 WASM module).
- Rollout plan: enable in GitLab via feature flag starting at 32KB, compare
p95 regex latency before/after using per-shard timing in search responses.
This supersedes #635 by porting its selective config-sync idea onto
current main with a smaller, easier-to-read shape. Clone orchestration
stays in gitindex/clone.go, while config argument generation and
existing-clone sync logic now live in gitindex/clone_config.go.
For existing clones, we now diff each zoekt.* setting before writing.
Unchanged values are skipped, changed values are updated, and settings
that disappear are removed. CloneRepo returns the repo destination only
when a setting change actually happened so the caller can trigger
reindexing only when needed.
cmd/zoekt-mirror-github was also cleaned up so optional integer metadata
keys are only added to the config map when present, which avoids pushing
empty config values downstream.
Note: On #635 review it mentioned using go-git. This commit initially
explored that but it ended up being a _lot_ more code due to missing
utilities around easily setting values based on a git config string.
The parser used to lift `type:` directives before converting `or` placeholders into real boolean nodes. That let parse-time `orOp` sentinels leak into `Type` children for unparenthesized queries and allowed malformed inputs like `type:repo or` to parse successfully until runtime.
Resolve operator placeholders before wrapping with `Type` so unparenthesized `or` inside a type scope produces a valid `Or` subtree and malformed `or` placement is rejected during parsing. The docs now clarify that `type:` applies to the whole expression in its current scope, including `or` clauses.
Amp-Thread-ID: https://ampcode.com/threads/T-019d0be7-4c9d-76fd-b556-103420d14be9
Co-authored-by: Amp <amp@ampcode.com>
Previously only the "top-level" "case:" affected the query. So if you
had a "case" in any sub expression it would have no effect. This adjusts
our implementation to track nested cases in our query parser and allow
them to affect the final case sensitivity on query.Qs.
Test Plan: added more test cases to demonstrate the problem. These
failed before this commit.
TestDirWatcherUnloadOnce occasionally failed in CI with a queued
"spurious load" event after Stop().
This is expected under fsnotify: a single logical write can emit multiple
filesystem events, and DirectoryWatcher can therefore enqueue extra load
notifications before shutdown completes. The old assertion treated any
leftover load event as a bug, which made the test timing-sensitive.
Update the test to drain queued load notifications after Stop() and keep
asserting that no extra drop event is emitted. That preserves the behavior
we care about (delete triggers unload, and unload is not repeated) while
removing a brittle assumption about load event multiplicity.
Reproduced flake before with:
go test ./search -run TestDirWatcherUnloadOnce -count=500
Validated after change with:
go test ./search -run TestDirWatcherUnloadOnce -count=500
Amp-Thread-ID: https://ampcode.com/threads/T-019d0104-367c-719f-a5f0-e7a95136407c
Co-authored-by: Amp <amp@ampcode.com>
This had likely been copied from the `zoekt-mirror-gerrit` documentation.
Bumped a bunch of dependencies which trivy reported had CVEs to latest:
go get \
github.com/cloudflare/circl@latest \
github.com/go-git/go-git/v5@latest \
go.opentelemetry.io/otel/sdk@latest
go mod tidy
Co-authored-by: Ravi Kumar <rkumar@gitlab.com>
`escape` escapes certain characters, but those don't include `+`, so clicking a `C++` language link will result in the `+`s not being escaped in the URL. This causes the search to search for `"C "` which finds no results.
`encodeURIComponent` is what should be used.
This fixes a bug caused by jsonv2 json.Unmarshal() behavior that is not backward compatible.
The infinite recursion occurs when UnmarshalJSON calls json.Unmarshal on itself.
See: https://github.com/golang/go/issues/75361
Currently the zoekt command's output isn't able to surface information
like which branch a match corresponds to. By streaming out the matches
as lines of json, it can be easily processed by other programs like jq.
This is potentially useful for a lot of cli tools. Personally I'm
looking to use this with fzf and bat to build a little search tui.
- github.com/cloudflare/circl v1.5.0 -> v1.6.1 (fixes CVE-2025-8556)
- golang.org/x/crypto v0.41.0 -> v0.45.0 (fixes CVE-2025-47914, CVE-2025-58181)
Amp-Thread-ID: https://ampcode.com/threads/T-542737d5-98ed-4e85-879f-76abc31d5b77
Co-authored-by: Amp <amp@ampcode.com>
- Replace otelgrpc.StreamServerInterceptor() and otelgrpc.UnaryServerInterceptor() with otelgrpc.NewServerHandler()
- Upgrade go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc from v0.58.0 to v0.63.0
- Stats handler approach is the recommended way forward and more efficient than interceptor-based approach
Amp-Thread-ID: https://ampcode.com/threads/T-2b4e5684-212d-42e5-b790-82569bc9f447
Co-authored-by: Amp <amp@ampcode.com>
I think these snuck in as we changed go versions
Includes one update to prevent some WARN log spam if your data directory
is not backed by a block device.
The /a/ prefix is required for authenticated git operations in default Gerrit
configurations. Gerrit's HttpScheme intentionally includes /a/ in clone URLs
to trigger authentication. Removing it breaks instances using default config.
The ServerInfo API already returns the correct URL format based on the
instance's configuration, so we should use it as-is.
Amp-Thread-ID: https://ampcode.com/threads/T-6944ab20-837d-4e78-a9ad-51083263baec
Co-authored-by: Amp <amp@ampcode.com>
Co-authored-by: John Mason <jmason@gitlab.com>
* Fix git.Open() fs argument for .git dir instead of worktree
The filesystem argument submitted to git.Open() is of the .git
directory, it should be for the worktree, as in go-git this logic is copied from:
https://github.com/go-git/go-git/blob/145daf2492dd86b1d7e7787555ebd9837b93bff8/repository.go#L373-L375
One side effect of this bug was that go-git would refuse to fetch repo
submodules
* Fix missing RepoURLs and LineFragments for result subrepositories
RepoURLs for subrepositories are currently filtered out when results
are streamed per-repo from shards.
* Add submodule indexing without repo cache
When not using repo cache, index git submodules recursively into
subrepos using the go-git API
* Adjust subrepo naming conventions
- Revert repo-cache behavior to unchanged
- Add support for naming repos and modules with no remote (use the dirname for the
former and subrepopath for the latter)
- Add test for submodule no-repo-cache behavior
* Fix overallocation in repoURL and lineFragment filtering
* Revert basename treatment from 08f67597
Missed the fact that it is already done in zoekt-git-index
command, the test was improperly configured
https://github.com/sourcegraph/zoekt/blob/4e4a529c3b63c7d4c7897ba736f1cd52cc163134/cmd/zoekt-git-index/main.go#L93-L100
Co-authored-by: ARyzhikov <ARyzhikov@nota.tech>
We want Zoekt and Sourcegraph to use the same language package. In this PR we move the languages package from Sourcegraph to Zoekt, so that Zoekt can use it and Sourcegraph can import it.
Notes:
- Zoekt doesn't need to fetch content async which is why I added a little helper func `GetLanguagesFromContent` to make the call sites in Zoekt less awkward.
- Sourcegraph's languages package always classified .cls files as Apex, while Zoekt did a content based check. With this PR we follow Zoekt's approach. Specifically, I removed .cls from `unsupportedByEnryExtensionToNameMap`. I added an additional unit test to cover this case.
Test plan:
I appended the test cases from the old Zoekt languages packages to the tests I copied over from Sourcegraph
In https://github.com/sourcegraph/zoekt/pull/962 a new field metadata
was added to zoekt.Repository, but it was not added to the grpc
types and converter methods, which is why the fuzz test occasionally
fails.
Test plan:
CI
ShardBuilder.Add still determined language before file category for callers
that bypass Builder.Add. In that path skipped content can be replaced with the
not-indexed marker, so doing language first leaves category detection
operating on synthetic content instead of the original file and misses the
cheaper skip-aware language path.
This changes ShardBuilder.Add to determine the file category before rewriting
skipped content, then infer language afterward so direct ShardBuilder callers
follow the same behavior as Builder.Add and keep content-aware categorization
for skipped documents.
Shard merging also reconstructs documents through ShardBuilder, so this PR
carries the category already stored in the source shard into the rebuilt
document. That keeps merge metadata-preserving instead of forcing category
inference to rediscover information the original indexer already knew.
The default for zoekt-git-index has always been true for incremental.
However, we originally used to use zoekt-archive-index which defaulted
to false for incremental. This meant that since we stopped using archive
index (5 years ago!) forcing non-incremental indexing has always lead to
incremental indexing.
Bitbucket Cloud and Azure DevOps repositories need host-specific URL templates so indexed search results can link back to commits and file locations in their native web UIs. The Azure file template uses the commit-version prefix because Zoekt stores commit SHAs for branch versions, avoiding a branch-only URL form.
Test Plan: go test ./gitindex -run TestSetTemplates
Many of the actions in our workflows were several major versions behind
their current releases. Stale action versions miss security fixes,
runtime improvements, and node runtime upgrades that newer GitHub
runners increasingly require.
This bumps each action to the current latest stable major version while
keeping the existing pinning style (major-version tag for first-party
or well-known actions, full SHA for the third-party fuzz action).
Test Plan: CI on the resulting PR will exercise every updated workflow.
If you set ZOEKT_DISABLE_GOGIT_OPTIMIZATION=true then zoekt-git-index
would fail on bare repositories. This was a regression from
a0f5789 Handle git indexing path correctly for Git worktrees
As such we also increase our test coverage to handle bare vs normal vs
worktrees. Additionally we add a test for when we disable the
optimization for bare repos, specifically targetting the regression we
saw.
Previously we hardcoded this value to 1mb (which aligns with what is set
in Sourcegraph). But we might as well set it based on the limit that is
passed in.
Additionally we set the special HTTP headers sourcegraph needs in the
config so that when git double checks if it needs to hydrate an object
it can do it successfully.
`gitindex` assumed repositories could be opened with the default go-git repo opener. That works for normal clones, but it breaks for Git worktrees because refs and object data are split between the worktree git dir and the shared common git dir.
This change introduces a small `plainOpenRepo` helper that uses `git`.
`PlainOpenWithOptions` with `DetectDotGit` and `EnableDotGitCommonDir`. The helper is then used in the plain-open code paths inside `gitindex`, including config/template loading.
With this change, Zoekt can read Git worktrees through go-git using the same logical repository view as Git itself, instead of failing on missing refs or incomplete repository layouts.
* gitindex: replace go-git blob reading with pipelined git cat-file --batch
Replace the serial go-git BlobObject calls in indexGitRepo with a single
pipelined "git cat-file --batch --buffer" subprocess. A writer goroutine
feeds all blob SHAs to stdin while the main goroutine reads responses
from stdout, forming a concurrent pipeline that eliminates per-object
packfile seek overhead and leverages git's internal delta base cache.
Submodule blobs fall back to the existing go-git createDocument path.
Benchmarked on kubernetes (29,188 files, 261 MB), Apple M1 Max, 5 runs:
go-git BlobObject (before):
Time: 2.94s Allocs: 685K Memory: 691 MB
cat-file pipelined (after):
Time: 0.60s Allocs: 58K Memory: 276 MB
Speedup: 4.9x time, 12x fewer allocs, 2.5x less memory
* gitindex: streaming catfileReader API, skip large blobs without reading
Replace the bulk readBlobsPipelined (which read all blobs into a
[]blobResult slice) with a streaming catfileReader modeled after
archive/tar.Reader:
cr, _ := newCatfileReader(repoDir, ids)
for {
size, missing, err := cr.Next()
if size > maxSize { continue } // auto-skipped, never read
content := make([]byte, size)
io.ReadFull(cr, content)
}
Next() reads the cat-file header and returns the blob's size. The
caller decides whether to Read the content or skip it — calling Next()
again automatically discards unread bytes via bufio.Reader.Discard.
Large blobs over SizeMax are never allocated or read into Go memory.
Also split the single interleaved loop into two: one for main-repo
blobs streamed via cat-file, one for submodule blobs via go-git's
createDocument. The builder sorts documents internally so ordering
between the loops does not matter.
Peak memory is now bounded by ShardMax (one shard's worth of content)
rather than total repository size.
* gitindex: harden catfileReader Close, add kill switch and SkipReasonMissing
Address review feedback on PR #1021:
- Make Close() idempotent via sync.Once; kill the git process first
(matching Gitaly's pattern) instead of draining all remaining stdout,
so early termination is fast. Suppress the expected SIGKILL exit error.
Add defer close(writeErr) in the writer goroutine to prevent deadlock
on double-close.
- Change Next() return and pending field from int64 to int, use
strconv.Atoi. Removes casts at all call sites; SizeMax is already int.
- Add SkipReasonMissing for blobs that git cat-file reports as missing,
instead of reusing SkipReasonTooLarge. Missing is unexpected for local
repos (corruption, shallow clone, gc race) so log a warning.
- Extract indexCatfileBlobs() with defer cr.Close(), eliminating four
manual Close() calls on error paths.
- Add ZOEKT_DISABLE_CATFILE_BATCH env var kill switch following the
existing ZOEKT_DISABLE_GOGIT_OPTIMIZATION pattern. When set, all blobs
fall back to the go-git createDocument path.
- Deduplicate skippedLargeDoc/skippedMissingDoc into skippedDoc(reason).
- Add 19 hardening tests covering Close lifecycle (double close,
concurrent close, early termination), Read edge cases (partial reads,
1-byte buffer, empty blobs, read-without-next), missing object
sequences, large blob byte precision, and duplicate SHAs.
Benchmarked on kubernetes (29,188 files): no performance regression
(geomean -0.89%, within noise).
* index: reduce GC pressure in posting list construction by 1.8x
Three targeted changes to newSearchableString, the hot path where ~54%
of CPU was spent on runtime memory management (memclr, memmove, madvise,
mapassign):
1. Merge postings and lastOffsets maps into a single map[ngram]*postingList.
Pointer-stored values mean the map is only written when a new ngram is
first seen (~282K writes for kubernetes) rather than on every trigram
occurrence (~169M). This cuts per-trigram map operations from 4 to ~1.
2. Pre-size the map using estimateNgrams(shardMaxBytes) and pre-allocate
each posting list to 1024 bytes, reducing slice growth events and
eliminating map rehashing.
3. Pool postingsBuilder instances via sync.Pool on the Builder, so
sequential and parallel shard builds reuse map and slice allocations
across shards instead of re-creating them.
Benchmarked on kubernetes (22.9K files, 169 MB, Apple M1 Max):
Cold path (NewSearchableString):
Time: 9.3s → 5.3s (-43%)
Allocs: 901K → 677K (-25%)
B/op: 1358 → 1536 MB (+13%, from larger pre-alloc)
Warm path (pooled reuse across shards):
Time: 9.3s → 5.1s (-45%)
Allocs: 901K → 23K (-97%)
B/op: 1358 → 0.6 MB (-99.96%)
WritePostings: ~155ms, unchanged.
* index: direct-indexed array for ASCII trigrams, 3.7x faster builds
Replace map lookups with a direct-indexed [2M]*postingList array for
ASCII trigrams (3 × 7-bit runes = 21-bit index, 16 MB of pointers).
Since 99%+ of trigrams in source code are pure ASCII, this eliminates
nearly all hash and probe overhead from the hot loop. Non-ASCII
trigrams still fall back to the map.
Also inline the ASCII check (data[0] < utf8.RuneSelf) to avoid
utf8.DecodeRune function call overhead on the 95-99% of bytes that
are ASCII.
In writePostings, collect ngrams from both the array and map into a
single sorted slice for writing. The on-disk format is unchanged.
Benchmarked on kubernetes (22.9K files, 169 MB, Apple M1 Max):
Cold path (NewSearchableString):
Before (with map opt): 5.3s, 677K allocs, 1536 MB
After: 2.5s, 676K allocs, 1544 MB
Speedup: 2.1x (3.7x vs original baseline)
Warm path (pooled reuse):
Before (with map opt): 5.1s, 23K allocs, 0.6 MB
After: 2.3s, 23K allocs, 0.6 MB
Speedup: 2.2x (4.0x vs original baseline)
WritePostings: ~130ms, unchanged.
The non-ASCII map now holds only ~6K entries (vs ~282K before), since
the vast majority of trigrams are served by the direct array.
* index: fix stale comments after ASCII array introduction
Update three comments that still referenced map-only storage after the
direct-indexed ASCII array was added: postingList doc, estimateNgrams
doc, and reset() doc.
* index: address review feedback from keegancsmith
- Replace putPostingsBuilder with returnPostingsBuilders(*ShardBuilder)
that returns both content and name builders to the pool and nils the
fields, so any subsequent misuse panics obviously.
- Drop error-path pool returns in newShardBuilder: setRepository errors
are extremely rare (invalid templates or >64 branches), not worth the
code complexity.
- Thread shardMax through newShardBuilder(shardMax int). Callers without
a shard size context (merge, tests, public API) pass 0 for the default
(100 MB). Builder.newShardBuilder passes b.opts.ShardMax via the pool.
- Switch sort.Slice to slices.SortFunc in writePostings for type safety
and to avoid interface boxing overhead.
* index: sparse ASCII index, reduce initialPostingCap to 64
Address review feedback:
- Add asciiPopulated []uint32 sparse index so reset() and
writePostings iterate only populated slots (~275K) instead
of scanning all 2M. Retains postingList allocations for
pool reuse via len(pl.data)==0 detection on the hot path.
- Reduce initialPostingCap from 1024 to 64. On kubernetes
(282K trigrams), median posting list is 10 bytes and 78%
are under 64. Drops pre-allocation waste from 244 MB to
11 MB. Cold-path B/op: 1558 MB → 1352 MB (-13%).
Adds an optional hybrid regex engine (internal/hybridre2) that transparently
switches between grafana/regexp and wasilibs/go-re2 (RE2 via WebAssembly)
based on file content size. Disabled by default — no behaviour change
without opt-in via ZOEKT_RE2_THRESHOLD_BYTES.
## Motivation
Issue #323 identified regex as the dominant CPU consumer in zoekt's
webserver profile. Go's regexp engine (including the grafana/regexp fork
already in use) lacks a lazy DFA. RE2's lazy DFA provides linear-time
matching with much better constant factors for alternations, character
classes, and complex patterns on large inputs.
The tradeoff: go-re2 uses WebAssembly (~600ns per-call overhead), making
it slower than grafana/regexp for small inputs (<4KB) but dramatically
faster above the threshold. A full engine swap would regress small-file
searches, so a threshold-based hybrid is the pragmatic approach.
## Implementation
### New package: internal/hybridre2
hybridre2.Regexp compiles both engines once at query-parse time and
dispatches FindAllIndex based on len(input) >= Threshold():
func (re *Regexp) FindAllIndex(b []byte, n int) [][]int {
if useRE2(len(b)) {
return re.re2.FindAllIndex(b, n)
}
return re.grafana.FindAllIndex(b, n)
}
### Change to index/matchtree.go
regexpMatchTree gains a hybridRegexp field used for file content matching;
filename matching keeps using grafana/regexp directly (filenames are always
short, so WASM overhead dominates there).
### Configuration
ZOEKT_RE2_THRESHOLD_BYTES env var, read once at startup:
-1 (default): disabled — always grafana/regexp, zero behaviour change
0: always use go-re2 (useful for evaluation/testing)
32768: use go-re2 for files >= 32KB (recommended starting point)
## Benchmarks
Hardware: AMD EPYC 9B14, go-re2 v1.10.0 (WASM, no CGO).
Alternations — `func|var|const|type|import`:
32KB: grafana 2505µs go-re2 467µs 5.4x speedup
128KB: grafana 9900µs go-re2 1699µs 5.8x speedup
512KB: grafana 40.7ms go-re2 6.8ms 6.0x speedup
Complex — `(func|var)\s+[A-Z]\w*\s*(`:
32KB: grafana 1237µs go-re2 230µs 5.4x speedup
128KB: grafana 4935µs go-re2 911µs 5.4x speedup
512KB: grafana 19.9ms go-re2 3.8ms 5.3x speedup
Literal — `main` (grafana wins; threshold protects this case):
32KB: grafana 33.2µs go-re2 59.8µs
## Testing
go test ./internal/hybridre2/ # unit + correctness matrix
go test ./index/ -short # full existing suite: passes
go test ./... -short # full suite: passes
Correctness verified by asserting identical match offsets between grafana
and go-re2 for 9 patterns x 5 sizes (64B-256KB).
## Notes
- Binary/non-UTF-8 content: go-re2 stops at invalid UTF-8 (vs. grafana
which replaces with the replacement character). The default threshold of
-1 ensures zero behaviour change. Operators enabling the threshold should
be aware; future work could detect non-UTF-8 and force the grafana path.
- Dependency: github.com/wasilibs/go-re2 v1.10.0 — pure Go WASM, no system
deps. Binary size increase: ~2MB (the embedded RE2 WASM module).
- Rollout plan: enable in GitLab via feature flag starting at 32KB, compare
p95 regex latency before/after using per-shard timing in search responses.
This supersedes #635 by porting its selective config-sync idea onto
current main with a smaller, easier-to-read shape. Clone orchestration
stays in gitindex/clone.go, while config argument generation and
existing-clone sync logic now live in gitindex/clone_config.go.
For existing clones, we now diff each zoekt.* setting before writing.
Unchanged values are skipped, changed values are updated, and settings
that disappear are removed. CloneRepo returns the repo destination only
when a setting change actually happened so the caller can trigger
reindexing only when needed.
cmd/zoekt-mirror-github was also cleaned up so optional integer metadata
keys are only added to the config map when present, which avoids pushing
empty config values downstream.
Note: On #635 review it mentioned using go-git. This commit initially
explored that but it ended up being a _lot_ more code due to missing
utilities around easily setting values based on a git config string.
The parser used to lift `type:` directives before converting `or` placeholders into real boolean nodes. That let parse-time `orOp` sentinels leak into `Type` children for unparenthesized queries and allowed malformed inputs like `type:repo or` to parse successfully until runtime.
Resolve operator placeholders before wrapping with `Type` so unparenthesized `or` inside a type scope produces a valid `Or` subtree and malformed `or` placement is rejected during parsing. The docs now clarify that `type:` applies to the whole expression in its current scope, including `or` clauses.
Amp-Thread-ID: https://ampcode.com/threads/T-019d0be7-4c9d-76fd-b556-103420d14be9
Co-authored-by: Amp <amp@ampcode.com>
Previously only the "top-level" "case:" affected the query. So if you
had a "case" in any sub expression it would have no effect. This adjusts
our implementation to track nested cases in our query parser and allow
them to affect the final case sensitivity on query.Qs.
Test Plan: added more test cases to demonstrate the problem. These
failed before this commit.
TestDirWatcherUnloadOnce occasionally failed in CI with a queued
"spurious load" event after Stop().
This is expected under fsnotify: a single logical write can emit multiple
filesystem events, and DirectoryWatcher can therefore enqueue extra load
notifications before shutdown completes. The old assertion treated any
leftover load event as a bug, which made the test timing-sensitive.
Update the test to drain queued load notifications after Stop() and keep
asserting that no extra drop event is emitted. That preserves the behavior
we care about (delete triggers unload, and unload is not repeated) while
removing a brittle assumption about load event multiplicity.
Reproduced flake before with:
go test ./search -run TestDirWatcherUnloadOnce -count=500
Validated after change with:
go test ./search -run TestDirWatcherUnloadOnce -count=500
Amp-Thread-ID: https://ampcode.com/threads/T-019d0104-367c-719f-a5f0-e7a95136407c
Co-authored-by: Amp <amp@ampcode.com>
Currently the zoekt command's output isn't able to surface information
like which branch a match corresponds to. By streaming out the matches
as lines of json, it can be easily processed by other programs like jq.
This is potentially useful for a lot of cli tools. Personally I'm
looking to use this with fzf and bat to build a little search tui.
- Replace otelgrpc.StreamServerInterceptor() and otelgrpc.UnaryServerInterceptor() with otelgrpc.NewServerHandler()
- Upgrade go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc from v0.58.0 to v0.63.0
- Stats handler approach is the recommended way forward and more efficient than interceptor-based approach
Amp-Thread-ID: https://ampcode.com/threads/T-2b4e5684-212d-42e5-b790-82569bc9f447
Co-authored-by: Amp <amp@ampcode.com>
The /a/ prefix is required for authenticated git operations in default Gerrit
configurations. Gerrit's HttpScheme intentionally includes /a/ in clone URLs
to trigger authentication. Removing it breaks instances using default config.
The ServerInfo API already returns the correct URL format based on the
instance's configuration, so we should use it as-is.
Amp-Thread-ID: https://ampcode.com/threads/T-6944ab20-837d-4e78-a9ad-51083263baec
Co-authored-by: Amp <amp@ampcode.com>
* Fix git.Open() fs argument for .git dir instead of worktree
The filesystem argument submitted to git.Open() is of the .git
directory, it should be for the worktree, as in go-git this logic is copied from:
https://github.com/go-git/go-git/blob/145daf2492dd86b1d7e7787555ebd9837b93bff8/repository.go#L373-L375
One side effect of this bug was that go-git would refuse to fetch repo
submodules
* Fix missing RepoURLs and LineFragments for result subrepositories
RepoURLs for subrepositories are currently filtered out when results
are streamed per-repo from shards.
* Add submodule indexing without repo cache
When not using repo cache, index git submodules recursively into
subrepos using the go-git API
* Adjust subrepo naming conventions
- Revert repo-cache behavior to unchanged
- Add support for naming repos and modules with no remote (use the dirname for the
former and subrepopath for the latter)
- Add test for submodule no-repo-cache behavior
* Fix overallocation in repoURL and lineFragment filtering
* Revert basename treatment from 08f67597
Missed the fact that it is already done in zoekt-git-index
command, the test was improperly configured
https://github.com/sourcegraph/zoekt/blob/4e4a529c3b63c7d4c7897ba736f1cd52cc163134/cmd/zoekt-git-index/main.go#L93-L100
We want Zoekt and Sourcegraph to use the same language package. In this PR we move the languages package from Sourcegraph to Zoekt, so that Zoekt can use it and Sourcegraph can import it.
Notes:
- Zoekt doesn't need to fetch content async which is why I added a little helper func `GetLanguagesFromContent` to make the call sites in Zoekt less awkward.
- Sourcegraph's languages package always classified .cls files as Apex, while Zoekt did a content based check. With this PR we follow Zoekt's approach. Specifically, I removed .cls from `unsupportedByEnryExtensionToNameMap`. I added an additional unit test to cover this case.
Test plan:
I appended the test cases from the old Zoekt languages packages to the tests I copied over from Sourcegraph