gitindex: replace go-git blob reading with pipelined git cat-file --batch (#1021)
* gitindex: replace go-git blob reading with pipelined git cat-file --batch
Replace the serial go-git BlobObject calls in indexGitRepo with a single
pipelined "git cat-file --batch --buffer" subprocess. A writer goroutine
feeds all blob SHAs to stdin while the main goroutine reads responses
from stdout, forming a concurrent pipeline that eliminates per-object
packfile seek overhead and leverages git's internal delta base cache.
Submodule blobs fall back to the existing go-git createDocument path.
Benchmarked on kubernetes (29,188 files, 261 MB), Apple M1 Max, 5 runs:
go-git BlobObject (before):
Time: 2.94s Allocs: 685K Memory: 691 MB
cat-file pipelined (after):
Time: 0.60s Allocs: 58K Memory: 276 MB
Speedup: 4.9x time, 12x fewer allocs, 2.5x less memory
* gitindex: streaming catfileReader API, skip large blobs without reading
Replace the bulk readBlobsPipelined (which read all blobs into a
[]blobResult slice) with a streaming catfileReader modeled after
archive/tar.Reader:
cr, _ := newCatfileReader(repoDir, ids)
for {
size, missing, err := cr.Next()
if size > maxSize { continue } // auto-skipped, never read
content := make([]byte, size)
io.ReadFull(cr, content)
}
Next() reads the cat-file header and returns the blob's size. The
caller decides whether to Read the content or skip it — calling Next()
again automatically discards unread bytes via bufio.Reader.Discard.
Large blobs over SizeMax are never allocated or read into Go memory.
Also split the single interleaved loop into two: one for main-repo
blobs streamed via cat-file, one for submodule blobs via go-git's
createDocument. The builder sorts documents internally so ordering
between the loops does not matter.
Peak memory is now bounded by ShardMax (one shard's worth of content)
rather than total repository size.
* gitindex: harden catfileReader Close, add kill switch and SkipReasonMissing
Address review feedback on PR #1021:
- Make Close() idempotent via sync.Once; kill the git process first
(matching Gitaly's pattern) instead of draining all remaining stdout,
so early termination is fast. Suppress the expected SIGKILL exit error.
Add defer close(writeErr) in the writer goroutine to prevent deadlock
on double-close.
- Change Next() return and pending field from int64 to int, use
strconv.Atoi. Removes casts at all call sites; SizeMax is already int.
- Add SkipReasonMissing for blobs that git cat-file reports as missing,
instead of reusing SkipReasonTooLarge. Missing is unexpected for local
repos (corruption, shallow clone, gc race) so log a warning.
- Extract indexCatfileBlobs() with defer cr.Close(), eliminating four
manual Close() calls on error paths.
- Add ZOEKT_DISABLE_CATFILE_BATCH env var kill switch following the
existing ZOEKT_DISABLE_GOGIT_OPTIMIZATION pattern. When set, all blobs
fall back to the go-git createDocument path.
- Deduplicate skippedLargeDoc/skippedMissingDoc into skippedDoc(reason).
- Add 19 hardening tests covering Close lifecycle (double close,
concurrent close, early termination), Read edge cases (partial reads,
1-byte buffer, empty blobs, read-without-next), missing object
sequences, large blob byte precision, and duplicate SHAs.
Benchmarked on kubernetes (29,188 files): no performance regression
(geomean -0.89%, within noise).