gitindex: replace go-git blob reading with pipelined git cat-file --batch (#1021)
* gitindex: replace go-git blob reading with pipelined git cat-file --batch
Replace the serial go-git BlobObject calls in indexGitRepo with a single
pipelined "git cat-file --batch --buffer" subprocess. A writer goroutine
feeds all blob SHAs to stdin while the main goroutine reads responses
from stdout, forming a concurrent pipeline that eliminates per-object
packfile seek overhead and leverages git's internal delta base cache.
Submodule blobs fall back to the existing go-git createDocument path.
Benchmarked on kubernetes (29,188 files, 261 MB), Apple M1 Max, 5 runs:
go-git BlobObject (before):
Time: 2.94s Allocs: 685K Memory: 691 MB
cat-file pipelined (after):
Time: 0.60s Allocs: 58K Memory: 276 MB
Speedup: 4.9x time, 12x fewer allocs, 2.5x less memory
* gitindex: streaming catfileReader API, skip large blobs without reading
Replace the bulk readBlobsPipelined (which read all blobs into a
[]blobResult slice) with a streaming catfileReader modeled after
archive/tar.Reader:
cr, _ := newCatfileReader(repoDir, ids)
for {
size, missing, err := cr.Next()
if size > maxSize { continue } // auto-skipped, never read
content := make([]byte, size)
io.ReadFull(cr, content)
}
Next() reads the cat-file header and returns the blob's size. The
caller decides whether to Read the content or skip it — calling Next()
again automatically discards unread bytes via bufio.Reader.Discard.
Large blobs over SizeMax are never allocated or read into Go memory.
Also split the single interleaved loop into two: one for main-repo
blobs streamed via cat-file, one for submodule blobs via go-git's
createDocument. The builder sorts documents internally so ordering
between the loops does not matter.
Peak memory is now bounded by ShardMax (one shard's worth of content)
rather than total repository size.
* gitindex: harden catfileReader Close, add kill switch and SkipReasonMissing
Address review feedback on PR #1021:
- Make Close() idempotent via sync.Once; kill the git process first
(matching Gitaly's pattern) instead of draining all remaining stdout,
so early termination is fast. Suppress the expected SIGKILL exit error.
Add defer close(writeErr) in the writer goroutine to prevent deadlock
on double-close.
- Change Next() return and pending field from int64 to int, use
strconv.Atoi. Removes casts at all call sites; SizeMax is already int.
- Add SkipReasonMissing for blobs that git cat-file reports as missing,
instead of reusing SkipReasonTooLarge. Missing is unexpected for local
repos (corruption, shallow clone, gc race) so log a warning.
- Extract indexCatfileBlobs() with defer cr.Close(), eliminating four
manual Close() calls on error paths.
- Add ZOEKT_DISABLE_CATFILE_BATCH env var kill switch following the
existing ZOEKT_DISABLE_GOGIT_OPTIMIZATION pattern. When set, all blobs
fall back to the go-git createDocument path.
- Deduplicate skippedLargeDoc/skippedMissingDoc into skippedDoc(reason).
- Add 19 hardening tests covering Close lifecycle (double close,
concurrent close, early termination), Read edge cases (partial reads,
1-byte buffer, empty blobs, read-without-next), missing object
sequences, large blob byte precision, and duplicate SHAs.
Benchmarked on kubernetes (29,188 files): no performance regression
(geomean -0.89%, within noise).
gitindex: replace go-git blob reading with pipelined git cat-file --batch (#1021)
* gitindex: replace go-git blob reading with pipelined git cat-file --batch
Replace the serial go-git BlobObject calls in indexGitRepo with a single
pipelined "git cat-file --batch --buffer" subprocess. A writer goroutine
feeds all blob SHAs to stdin while the main goroutine reads responses
from stdout, forming a concurrent pipeline that eliminates per-object
packfile seek overhead and leverages git's internal delta base cache.
Submodule blobs fall back to the existing go-git createDocument path.
Benchmarked on kubernetes (29,188 files, 261 MB), Apple M1 Max, 5 runs:
go-git BlobObject (before):
Time: 2.94s Allocs: 685K Memory: 691 MB
cat-file pipelined (after):
Time: 0.60s Allocs: 58K Memory: 276 MB
Speedup: 4.9x time, 12x fewer allocs, 2.5x less memory
* gitindex: streaming catfileReader API, skip large blobs without reading
Replace the bulk readBlobsPipelined (which read all blobs into a
[]blobResult slice) with a streaming catfileReader modeled after
archive/tar.Reader:
cr, _ := newCatfileReader(repoDir, ids)
for {
size, missing, err := cr.Next()
if size > maxSize { continue } // auto-skipped, never read
content := make([]byte, size)
io.ReadFull(cr, content)
}
Next() reads the cat-file header and returns the blob's size. The
caller decides whether to Read the content or skip it — calling Next()
again automatically discards unread bytes via bufio.Reader.Discard.
Large blobs over SizeMax are never allocated or read into Go memory.
Also split the single interleaved loop into two: one for main-repo
blobs streamed via cat-file, one for submodule blobs via go-git's
createDocument. The builder sorts documents internally so ordering
between the loops does not matter.
Peak memory is now bounded by ShardMax (one shard's worth of content)
rather than total repository size.
* gitindex: harden catfileReader Close, add kill switch and SkipReasonMissing
Address review feedback on PR #1021:
- Make Close() idempotent via sync.Once; kill the git process first
(matching Gitaly's pattern) instead of draining all remaining stdout,
so early termination is fast. Suppress the expected SIGKILL exit error.
Add defer close(writeErr) in the writer goroutine to prevent deadlock
on double-close.
- Change Next() return and pending field from int64 to int, use
strconv.Atoi. Removes casts at all call sites; SizeMax is already int.
- Add SkipReasonMissing for blobs that git cat-file reports as missing,
instead of reusing SkipReasonTooLarge. Missing is unexpected for local
repos (corruption, shallow clone, gc race) so log a warning.
- Extract indexCatfileBlobs() with defer cr.Close(), eliminating four
manual Close() calls on error paths.
- Add ZOEKT_DISABLE_CATFILE_BATCH env var kill switch following the
existing ZOEKT_DISABLE_GOGIT_OPTIMIZATION pattern. When set, all blobs
fall back to the go-git createDocument path.
- Deduplicate skippedLargeDoc/skippedMissingDoc into skippedDoc(reason).
- Add 19 hardening tests covering Close lifecycle (double close,
concurrent close, early termination), Read edge cases (partial reads,
1-byte buffer, empty blobs, read-without-next), missing object
sequences, large blob byte precision, and duplicate SHAs.
Benchmarked on kubernetes (29,188 files): no performance regression
(geomean -0.89%, within noise).
index: 3.7x faster posting list construction via direct-indexed ASCII array (#1020)
* index: reduce GC pressure in posting list construction by 1.8x
Three targeted changes to newSearchableString, the hot path where ~54%
of CPU was spent on runtime memory management (memclr, memmove, madvise,
mapassign):
1. Merge postings and lastOffsets maps into a single map[ngram]*postingList.
Pointer-stored values mean the map is only written when a new ngram is
first seen (~282K writes for kubernetes) rather than on every trigram
occurrence (~169M). This cuts per-trigram map operations from 4 to ~1.
2. Pre-size the map using estimateNgrams(shardMaxBytes) and pre-allocate
each posting list to 1024 bytes, reducing slice growth events and
eliminating map rehashing.
3. Pool postingsBuilder instances via sync.Pool on the Builder, so
sequential and parallel shard builds reuse map and slice allocations
across shards instead of re-creating them.
Benchmarked on kubernetes (22.9K files, 169 MB, Apple M1 Max):
Cold path (NewSearchableString):
Time: 9.3s → 5.3s (-43%)
Allocs: 901K → 677K (-25%)
B/op: 1358 → 1536 MB (+13%, from larger pre-alloc)
Warm path (pooled reuse across shards):
Time: 9.3s → 5.1s (-45%)
Allocs: 901K → 23K (-97%)
B/op: 1358 → 0.6 MB (-99.96%)
WritePostings: ~155ms, unchanged.
* index: direct-indexed array for ASCII trigrams, 3.7x faster builds
Replace map lookups with a direct-indexed [2M]*postingList array for
ASCII trigrams (3 × 7-bit runes = 21-bit index, 16 MB of pointers).
Since 99%+ of trigrams in source code are pure ASCII, this eliminates
nearly all hash and probe overhead from the hot loop. Non-ASCII
trigrams still fall back to the map.
Also inline the ASCII check (data[0] < utf8.RuneSelf) to avoid
utf8.DecodeRune function call overhead on the 95-99% of bytes that
are ASCII.
In writePostings, collect ngrams from both the array and map into a
single sorted slice for writing. The on-disk format is unchanged.
Benchmarked on kubernetes (22.9K files, 169 MB, Apple M1 Max):
Cold path (NewSearchableString):
Before (with map opt): 5.3s, 677K allocs, 1536 MB
After: 2.5s, 676K allocs, 1544 MB
Speedup: 2.1x (3.7x vs original baseline)
Warm path (pooled reuse):
Before (with map opt): 5.1s, 23K allocs, 0.6 MB
After: 2.3s, 23K allocs, 0.6 MB
Speedup: 2.2x (4.0x vs original baseline)
WritePostings: ~130ms, unchanged.
The non-ASCII map now holds only ~6K entries (vs ~282K before), since
the vast majority of trigrams are served by the direct array.
* index: fix stale comments after ASCII array introduction
Update three comments that still referenced map-only storage after the
direct-indexed ASCII array was added: postingList doc, estimateNgrams
doc, and reset() doc.
* index: address review feedback from keegancsmith
- Replace putPostingsBuilder with returnPostingsBuilders(*ShardBuilder)
that returns both content and name builders to the pool and nils the
fields, so any subsequent misuse panics obviously.
- Drop error-path pool returns in newShardBuilder: setRepository errors
are extremely rare (invalid templates or >64 branches), not worth the
code complexity.
- Thread shardMax through newShardBuilder(shardMax int). Callers without
a shard size context (merge, tests, public API) pass 0 for the default
(100 MB). Builder.newShardBuilder passes b.opts.ShardMax via the pool.
- Switch sort.Slice to slices.SortFunc in writePostings for type safety
and to avoid interface boxing overhead.
* index: sparse ASCII index, reduce initialPostingCap to 64
Address review feedback:
- Add asciiPopulated []uint32 sparse index so reset() and
writePostings iterate only populated slots (~275K) instead
of scanning all 2M. Retains postingList allocations for
pool reuse via len(pl.data)==0 detection on the hot path.
- Reduce initialPostingCap from 1024 to 64. On kubernetes
(282K trigrams), median posting list is 10 bytes and 78%
are under 64. Drops pre-allocation waste from 244 MB to
11 MB. Cold-path B/op: 1558 MB → 1352 MB (-13%).
index: 3.7x faster posting list construction via direct-indexed ASCII array (#1020)
* index: reduce GC pressure in posting list construction by 1.8x
Three targeted changes to newSearchableString, the hot path where ~54%
of CPU was spent on runtime memory management (memclr, memmove, madvise,
mapassign):
1. Merge postings and lastOffsets maps into a single map[ngram]*postingList.
Pointer-stored values mean the map is only written when a new ngram is
first seen (~282K writes for kubernetes) rather than on every trigram
occurrence (~169M). This cuts per-trigram map operations from 4 to ~1.
2. Pre-size the map using estimateNgrams(shardMaxBytes) and pre-allocate
each posting list to 1024 bytes, reducing slice growth events and
eliminating map rehashing.
3. Pool postingsBuilder instances via sync.Pool on the Builder, so
sequential and parallel shard builds reuse map and slice allocations
across shards instead of re-creating them.
Benchmarked on kubernetes (22.9K files, 169 MB, Apple M1 Max):
Cold path (NewSearchableString):
Time: 9.3s → 5.3s (-43%)
Allocs: 901K → 677K (-25%)
B/op: 1358 → 1536 MB (+13%, from larger pre-alloc)
Warm path (pooled reuse across shards):
Time: 9.3s → 5.1s (-45%)
Allocs: 901K → 23K (-97%)
B/op: 1358 → 0.6 MB (-99.96%)
WritePostings: ~155ms, unchanged.
* index: direct-indexed array for ASCII trigrams, 3.7x faster builds
Replace map lookups with a direct-indexed [2M]*postingList array for
ASCII trigrams (3 × 7-bit runes = 21-bit index, 16 MB of pointers).
Since 99%+ of trigrams in source code are pure ASCII, this eliminates
nearly all hash and probe overhead from the hot loop. Non-ASCII
trigrams still fall back to the map.
Also inline the ASCII check (data[0] < utf8.RuneSelf) to avoid
utf8.DecodeRune function call overhead on the 95-99% of bytes that
are ASCII.
In writePostings, collect ngrams from both the array and map into a
single sorted slice for writing. The on-disk format is unchanged.
Benchmarked on kubernetes (22.9K files, 169 MB, Apple M1 Max):
Cold path (NewSearchableString):
Before (with map opt): 5.3s, 677K allocs, 1536 MB
After: 2.5s, 676K allocs, 1544 MB
Speedup: 2.1x (3.7x vs original baseline)
Warm path (pooled reuse):
Before (with map opt): 5.1s, 23K allocs, 0.6 MB
After: 2.3s, 23K allocs, 0.6 MB
Speedup: 2.2x (4.0x vs original baseline)
WritePostings: ~130ms, unchanged.
The non-ASCII map now holds only ~6K entries (vs ~282K before), since
the vast majority of trigrams are served by the direct array.
* index: fix stale comments after ASCII array introduction
Update three comments that still referenced map-only storage after the
direct-indexed ASCII array was added: postingList doc, estimateNgrams
doc, and reset() doc.
* index: address review feedback from keegancsmith
- Replace putPostingsBuilder with returnPostingsBuilders(*ShardBuilder)
that returns both content and name builders to the pool and nils the
fields, so any subsequent misuse panics obviously.
- Drop error-path pool returns in newShardBuilder: setRepository errors
are extremely rare (invalid templates or >64 branches), not worth the
code complexity.
- Thread shardMax through newShardBuilder(shardMax int). Callers without
a shard size context (merge, tests, public API) pass 0 for the default
(100 MB). Builder.newShardBuilder passes b.opts.ShardMax via the pool.
- Switch sort.Slice to slices.SortFunc in writePostings for type safety
and to avoid interface boxing overhead.
* index: sparse ASCII index, reduce initialPostingCap to 64
Address review feedback:
- Add asciiPopulated []uint32 sparse index so reset() and
writePostings iterate only populated slots (~275K) instead
of scanning all 2M. Retains postingList allocations for
pool reuse via len(pl.data)==0 detection on the hot path.
- Reduce initialPostingCap from 1024 to 64. On kubernetes
(282K trigrams), median posting list is 10 bytes and 78%
are under 64. Drops pre-allocation waste from 244 MB to
11 MB. Cold-path B/op: 1558 MB → 1352 MB (-13%).