zoekt/index/ at tngl · boltless.me/zoekt · Tangled

boltless.me / zoekt

0

fork of https://github.com/sourcegraph/zoekt

0

zoekt / index /

at tngl 42 files

Seongmin Lee use repo URL as an identifier 13hrs ago

use repo URL as an identifier Signed-off-by: Seongmin Lee <git@boltless.me>

13 hours ago

Fix compilation on 32 bit architectures (#936) This PR fixes a bug where Zoekt would not compile on 32-bit architectures. It also takes the opportunity to start using the `math` library everywhere instead of our own constants like `maxUInt32` to help prevent this sort of issue in the future by encouraging devs to select the most accurate "max" type for their specific situation. Closes https://github.com/sourcegraph/zoekt/issues/935

1 year ago

all: run modernize across codebase (#919) The latest release of gopls has a feature called modernize which will update your code where it can to use modern go features/pkgs. https://github.com/golang/tools/releases/tag/gopls%2Fv0.18.0 Generated with: go run golang.org/x/tools/gopls/internal/analysis/modernize/cmd/modernize@latest -fix -test ./... Test Plan: CI

1 year ago

all: run modernize across codebase (#919) The latest release of gopls has a feature called modernize which will update your code where it can to use modern go features/pkgs. https://github.com/golang/tools/releases/tag/gopls%2Fv0.18.0 Generated with: go run golang.org/x/tools/gopls/internal/analysis/modernize/cmd/modernize@latest -fix -test ./... Test Plan: CI

1 year ago

gitindex: optimize git index time by ~21% (#1036)

2 months ago

builder_test.go

gitindex: optimize git index time by ~21% (#1036)

2 months ago

contentprovider.go

all: run modernize across codebase (#919) The latest release of gopls has a feature called modernize which will update your code where it can to use modern go features/pkgs. https://github.com/golang/tools/releases/tag/gopls%2Fv0.18.0 Generated with: go run golang.org/x/tools/gopls/internal/analysis/modernize/cmd/modernize@latest -fix -test ./... Test Plan: CI

1 year ago

contentprovider_test.go

Move root-level index code to index package (#902) In the repo root, we have a bunch of low level logic around index building and searching. So we end up exposing internal logic through the main public `zoekt` package, for example `zoekt.Merge(...)`. This PR moves it into the `build` package, so all code related to index building lives together. It then renames `build` to `index` to reflect the broader focus on indexing and searching the index.

1 year ago

gitindex: optimize git index time by ~21% (#1036)

2 months ago

Move root-level index code to index package (#902) In the repo root, we have a bunch of low level logic around index building and searching. So we end up exposing internal logic through the main public `zoekt` package, for example `zoekt.Merge(...)`. This PR moves it into the `build` package, so all code related to index building lives together. It then renames `build` to `index` to reflect the broader focus on indexing and searching the index.

1 year ago

docmatchtreecache.go

Cache docMatchTree results for Meta conditions (#982)

9 months ago

docmatchtreecache_test.go

Cache docMatchTree results for Meta conditions (#982)

9 months ago

gitindex: replace go-git blob reading with pipelined git cat-file --batch (#1021) * gitindex: replace go-git blob reading with pipelined git cat-file --batch Replace the serial go-git BlobObject calls in indexGitRepo with a single pipelined "git cat-file --batch --buffer" subprocess. A writer goroutine feeds all blob SHAs to stdin while the main goroutine reads responses from stdout, forming a concurrent pipeline that eliminates per-object packfile seek overhead and leverages git's internal delta base cache. Submodule blobs fall back to the existing go-git createDocument path. Benchmarked on kubernetes (29,188 files, 261 MB), Apple M1 Max, 5 runs: go-git BlobObject (before): Time: 2.94s Allocs: 685K Memory: 691 MB cat-file pipelined (after): Time: 0.60s Allocs: 58K Memory: 276 MB Speedup: 4.9x time, 12x fewer allocs, 2.5x less memory * gitindex: streaming catfileReader API, skip large blobs without reading Replace the bulk readBlobsPipelined (which read all blobs into a []blobResult slice) with a streaming catfileReader modeled after archive/tar.Reader: cr, _ := newCatfileReader(repoDir, ids) for { size, missing, err := cr.Next() if size > maxSize { continue } // auto-skipped, never read content := make([]byte, size) io.ReadFull(cr, content) } Next() reads the cat-file header and returns the blob's size. The caller decides whether to Read the content or skip it — calling Next() again automatically discards unread bytes via bufio.Reader.Discard. Large blobs over SizeMax are never allocated or read into Go memory. Also split the single interleaved loop into two: one for main-repo blobs streamed via cat-file, one for submodule blobs via go-git's createDocument. The builder sorts documents internally so ordering between the loops does not matter. Peak memory is now bounded by ShardMax (one shard's worth of content) rather than total repository size. * gitindex: harden catfileReader Close, add kill switch and SkipReasonMissing Address review feedback on PR #1021: - Make Close() idempotent via sync.Once; kill the git process first (matching Gitaly's pattern) instead of draining all remaining stdout, so early termination is fast. Suppress the expected SIGKILL exit error. Add defer close(writeErr) in the writer goroutine to prevent deadlock on double-close. - Change Next() return and pending field from int64 to int, use strconv.Atoi. Removes casts at all call sites; SizeMax is already int. - Add SkipReasonMissing for blobs that git cat-file reports as missing, instead of reusing SkipReasonTooLarge. Missing is unexpected for local repos (corruption, shallow clone, gc race) so log a warning. - Extract indexCatfileBlobs() with defer cr.Close(), eliminating four manual Close() calls on error paths. - Add ZOEKT_DISABLE_CATFILE_BATCH env var kill switch following the existing ZOEKT_DISABLE_GOGIT_OPTIMIZATION pattern. When set, all blobs fall back to the go-git createDocument path. - Deduplicate skippedLargeDoc/skippedMissingDoc into skippedDoc(reason). - Add 19 hardening tests covering Close lifecycle (double close, concurrent close, early termination), Read edge cases (partial reads, 1-byte buffer, empty blobs, read-without-next), missing object sequences, large blob byte precision, and duplicate SHAs. Benchmarked on kubernetes (29,188 files): no performance regression (geomean -0.89%, within noise).

3 months ago

use repo URL as an identifier Signed-off-by: Seongmin Lee <git@boltless.me>

13 hours ago

Add support for indexing and searching custom fields for repositories (#962) At GitLab, we encountered limitations when searching within large namespaces containing thousands of repositories. Specifically, we cannot pass a complete list of RepoIDs due to size constraints. This change introduces support for indexing and searching on custom repository metadata by extending Repository to include an additional Metadata field. All fields within Repository.Metadata are searchable using a regular expression evaluator. This enables more scalable filtering by allowing clients to express regular expression prefix queries on metadata fields, such as: traversal_ids:123-456-.* Or any field really: haystack:nee.*le

1 year ago

file_category.go

gitindex: replace go-git blob reading with pipelined git cat-file --batch (#1021) * gitindex: replace go-git blob reading with pipelined git cat-file --batch Replace the serial go-git BlobObject calls in indexGitRepo with a single pipelined "git cat-file --batch --buffer" subprocess. A writer goroutine feeds all blob SHAs to stdin while the main goroutine reads responses from stdout, forming a concurrent pipeline that eliminates per-object packfile seek overhead and leverages git's internal delta base cache. Submodule blobs fall back to the existing go-git createDocument path. Benchmarked on kubernetes (29,188 files, 261 MB), Apple M1 Max, 5 runs: go-git BlobObject (before): Time: 2.94s Allocs: 685K Memory: 691 MB cat-file pipelined (after): Time: 0.60s Allocs: 58K Memory: 276 MB Speedup: 4.9x time, 12x fewer allocs, 2.5x less memory * gitindex: streaming catfileReader API, skip large blobs without reading Replace the bulk readBlobsPipelined (which read all blobs into a []blobResult slice) with a streaming catfileReader modeled after archive/tar.Reader: cr, _ := newCatfileReader(repoDir, ids) for { size, missing, err := cr.Next() if size > maxSize { continue } // auto-skipped, never read content := make([]byte, size) io.ReadFull(cr, content) } Next() reads the cat-file header and returns the blob's size. The caller decides whether to Read the content or skip it — calling Next() again automatically discards unread bytes via bufio.Reader.Discard. Large blobs over SizeMax are never allocated or read into Go memory. Also split the single interleaved loop into two: one for main-repo blobs streamed via cat-file, one for submodule blobs via go-git's createDocument. The builder sorts documents internally so ordering between the loops does not matter. Peak memory is now bounded by ShardMax (one shard's worth of content) rather than total repository size. * gitindex: harden catfileReader Close, add kill switch and SkipReasonMissing Address review feedback on PR #1021: - Make Close() idempotent via sync.Once; kill the git process first (matching Gitaly's pattern) instead of draining all remaining stdout, so early termination is fast. Suppress the expected SIGKILL exit error. Add defer close(writeErr) in the writer goroutine to prevent deadlock on double-close. - Change Next() return and pending field from int64 to int, use strconv.Atoi. Removes casts at all call sites; SizeMax is already int. - Add SkipReasonMissing for blobs that git cat-file reports as missing, instead of reusing SkipReasonTooLarge. Missing is unexpected for local repos (corruption, shallow clone, gc race) so log a warning. - Extract indexCatfileBlobs() with defer cr.Close(), eliminating four manual Close() calls on error paths. - Add ZOEKT_DISABLE_CATFILE_BATCH env var kill switch following the existing ZOEKT_DISABLE_GOGIT_OPTIMIZATION pattern. When set, all blobs fall back to the go-git createDocument path. - Deduplicate skippedLargeDoc/skippedMissingDoc into skippedDoc(reason). - Add 19 hardening tests covering Close lifecycle (double close, concurrent close, early termination), Read edge cases (partial reads, 1-byte buffer, empty blobs, read-without-next), missing object sequences, large blob byte precision, and duplicate SHAs. Benchmarked on kubernetes (29,188 files): no performance regression (geomean -0.89%, within noise).

3 months ago

file_category_test.go

ranking: incorporate file signals into BM25F (#922) This PR reworks the way we incorporate file signals into BM25. Previously, we were applying them as a tie-breaker. But in dogfooding, we found that these rarely impact results, because it's so rare to have a tie in BM25 scores. Now, we take the file signal into account when computing BM25F. The interpretation is that this data lives in a separate 'field' that is half the priority of regular content.

1 year ago

Fix compilation on 32 bit architectures (#936) This PR fixes a bug where Zoekt would not compile on 32-bit architectures. It also takes the opportunity to start using the `math` library everywhere instead of our own constants like `maxUInt32` to help prevent this sort of issue in the future by encouraging devs to select the most accurate "max" type for their specific situation. Closes https://github.com/sourcegraph/zoekt/issues/935

1 year ago

hititer_test.go

refactor(all): goimports -w -local github.com/sourcegraph/zoekt (#948)

1 year ago

use repo URL as an identifier Signed-off-by: Seongmin Lee <git@boltless.me>

13 hours ago

Cache docMatchTree results for Meta conditions (#982)

9 months ago

indexdata_test.go

Move root-level index code to index package (#902) In the repo root, we have a bunch of low level logic around index building and searching. So we end up exposing internal logic through the main public `zoekt` package, for example `zoekt.Merge(...)`. This PR moves it into the `build` package, so all code related to index building lives together. It then renames `build` to `index` to reflect the broader focus on indexing and searching the index.

1 year ago

Allow building on FreeBSD (#968) This fixes https://github.com/sourcegraph/zoekt/issues/967. I reported that also here: https://gitlab.com/gitlab-org/gitlab-zoekt-indexer/-/issues/91

11 months ago

Move root-level index code to index package (#902) In the repo root, we have a bunch of low level logic around index building and searching. So we end up exposing internal logic through the main public `zoekt` package, for example `zoekt.Merge(...)`. This PR moves it into the `build` package, so all code related to index building lives together. It then renames `build` to `index` to reflect the broader focus on indexing and searching the index.

1 year ago

refactor(all): goimports -w -local github.com/sourcegraph/zoekt (#948)

1 year ago

Fix compilation on 32 bit architectures (#936) This PR fixes a bug where Zoekt would not compile on 32-bit architectures. It also takes the opportunity to start using the `math` library everywhere instead of our own constants like `maxUInt32` to help prevent this sort of issue in the future by encouraging devs to select the most accurate "max" type for their specific situation. Closes https://github.com/sourcegraph/zoekt/issues/935

1 year ago

matchiter_test.go

Move root-level index code to index package (#902) In the repo root, we have a bunch of low level logic around index building and searching. So we end up exposing internal logic through the main public `zoekt` package, for example `zoekt.Merge(...)`. This PR moves it into the `build` package, so all code related to index building lives together. It then renames `build` to `index` to reflect the broader focus on indexing and searching the index.

1 year ago

use repo URL as an identifier Signed-off-by: Seongmin Lee <git@boltless.me>

13 hours ago

matchtree_test.go

Cache docMatchTree results for Meta conditions (#982)

9 months ago

fix/index: preserve skipped-file category through shard paths (#1073) ShardBuilder.Add still determined language before file category for callers that bypass Builder.Add. In that path skipped content can be replaced with the not-indexed marker, so doing language first leaves category detection operating on synthetic content instead of the original file and misses the cheaper skip-aware language path. This changes ShardBuilder.Add to determine the file category before rewriting skipped content, then infer language afterward so direct ShardBuilder callers follow the same behavior as Builder.Add and keep content-aware categorization for skipped documents. Shard merging also reconstructs documents through ShardBuilder, so this PR carries the category already stored in the source shard into the rebuilt document. That keeps merge metadata-preserving instead of forcing category inference to rediscover information the original indexer already knew.

1 week ago

fix/index: preserve skipped-file category through shard paths (#1073) ShardBuilder.Add still determined language before file category for callers that bypass Builder.Add. In that path skipped content can be replaced with the not-indexed marker, so doing language first leaves category detection operating on synthetic content instead of the original file and misses the cheaper skip-aware language path. This changes ShardBuilder.Add to determine the file category before rewriting skipped content, then infer language afterward so direct ShardBuilder callers follow the same behavior as Builder.Add and keep content-aware categorization for skipped documents. Shard merging also reconstructs documents through ShardBuilder, so this PR carries the category already stored in the source shard into the rebuilt document. That keeps merge metadata-preserving instead of forcing category inference to rediscover information the original indexer already knew.

1 week ago

postings_bench_test.go

index: 3.7x faster posting list construction via direct-indexed ASCII array (#1020) * index: reduce GC pressure in posting list construction by 1.8x Three targeted changes to newSearchableString, the hot path where ~54% of CPU was spent on runtime memory management (memclr, memmove, madvise, mapassign): 1. Merge postings and lastOffsets maps into a single map[ngram]*postingList. Pointer-stored values mean the map is only written when a new ngram is first seen (~282K writes for kubernetes) rather than on every trigram occurrence (~169M). This cuts per-trigram map operations from 4 to ~1. 2. Pre-size the map using estimateNgrams(shardMaxBytes) and pre-allocate each posting list to 1024 bytes, reducing slice growth events and eliminating map rehashing. 3. Pool postingsBuilder instances via sync.Pool on the Builder, so sequential and parallel shard builds reuse map and slice allocations across shards instead of re-creating them. Benchmarked on kubernetes (22.9K files, 169 MB, Apple M1 Max): Cold path (NewSearchableString): Time: 9.3s → 5.3s (-43%) Allocs: 901K → 677K (-25%) B/op: 1358 → 1536 MB (+13%, from larger pre-alloc) Warm path (pooled reuse across shards): Time: 9.3s → 5.1s (-45%) Allocs: 901K → 23K (-97%) B/op: 1358 → 0.6 MB (-99.96%) WritePostings: ~155ms, unchanged. * index: direct-indexed array for ASCII trigrams, 3.7x faster builds Replace map lookups with a direct-indexed [2M]*postingList array for ASCII trigrams (3 × 7-bit runes = 21-bit index, 16 MB of pointers). Since 99%+ of trigrams in source code are pure ASCII, this eliminates nearly all hash and probe overhead from the hot loop. Non-ASCII trigrams still fall back to the map. Also inline the ASCII check (data[0] < utf8.RuneSelf) to avoid utf8.DecodeRune function call overhead on the 95-99% of bytes that are ASCII. In writePostings, collect ngrams from both the array and map into a single sorted slice for writing. The on-disk format is unchanged. Benchmarked on kubernetes (22.9K files, 169 MB, Apple M1 Max): Cold path (NewSearchableString): Before (with map opt): 5.3s, 677K allocs, 1536 MB After: 2.5s, 676K allocs, 1544 MB Speedup: 2.1x (3.7x vs original baseline) Warm path (pooled reuse): Before (with map opt): 5.1s, 23K allocs, 0.6 MB After: 2.3s, 23K allocs, 0.6 MB Speedup: 2.2x (4.0x vs original baseline) WritePostings: ~130ms, unchanged. The non-ASCII map now holds only ~6K entries (vs ~282K before), since the vast majority of trigrams are served by the direct array. * index: fix stale comments after ASCII array introduction Update three comments that still referenced map-only storage after the direct-indexed ASCII array was added: postingList doc, estimateNgrams doc, and reset() doc. * index: address review feedback from keegancsmith - Replace putPostingsBuilder with returnPostingsBuilders(*ShardBuilder) that returns both content and name builders to the pool and nils the fields, so any subsequent misuse panics obviously. - Drop error-path pool returns in newShardBuilder: setRepository errors are extremely rare (invalid templates or >64 branches), not worth the code complexity. - Thread shardMax through newShardBuilder(shardMax int). Callers without a shard size context (merge, tests, public API) pass 0 for the default (100 MB). Builder.newShardBuilder passes b.opts.ShardMax via the pool. - Switch sort.Slice to slices.SortFunc in writePostings for type safety and to avoid interface boxing overhead. * index: sparse ASCII index, reduce initialPostingCap to 64 Address review feedback: - Add asciiPopulated []uint32 sparse index so reset() and writePostings iterate only populated slots (~275K) instead of scanning all 2M. Retains postingList allocations for pool reuse via len(pl.data)==0 detection on the hot path. - Reduce initialPostingCap from 1024 to 64. On kubernetes (282K trigrams), median posting list is 10 bytes and 78% are under 64. Drops pre-allocation waste from 244 MB to 11 MB. Cold-path B/op: 1558 MB → 1352 MB (-13%).

3 months ago

Cache docMatchTree results for Meta conditions (#982)

9 months ago

index: decide between tenant and non-tenant shard name in one place (#953) We have two places with duplicated logic around how it decides the layout of shards on disk. This now moves that decision into one place. Additionally we can now unexport index.ShardName. It was only used in one place outside the package, and that was easy to replace with a hardcoded string since it is just a test. Test Plan: Just CI. This has no actual change in functionality, just refactoring.

1 year ago

refactor(all): goimports -w -local github.com/sourcegraph/zoekt (#948)

1 year ago

Move root-level index code to index package (#902) In the repo root, we have a bunch of low level logic around index building and searching. So we end up exposing internal logic through the main public `zoekt` package, for example `zoekt.Merge(...)`. This PR moves it into the `build` package, so all code related to index building lives together. It then renames `build` to `index` to reflect the broader focus on indexing and searching the index.

1 year ago

shard_builder.go

fix/index: preserve skipped-file category through shard paths (#1073) ShardBuilder.Add still determined language before file category for callers that bypass Builder.Add. In that path skipped content can be replaced with the not-indexed marker, so doing language first leaves category detection operating on synthetic content instead of the original file and misses the cheaper skip-aware language path. This changes ShardBuilder.Add to determine the file category before rewriting skipped content, then infer language afterward so direct ShardBuilder callers follow the same behavior as Builder.Add and keep content-aware categorization for skipped documents. Shard merging also reconstructs documents through ShardBuilder, so this PR carries the category already stored in the source shard into the rebuilt document. That keeps merge metadata-preserving instead of forcing category inference to rediscover information the original indexer already knew.

1 week ago

shard_builder_test.go

fix/index: preserve skipped-file category through shard paths (#1073) ShardBuilder.Add still determined language before file category for callers that bypass Builder.Add. In that path skipped content can be replaced with the not-indexed marker, so doing language first leaves category detection operating on synthetic content instead of the original file and misses the cheaper skip-aware language path. This changes ShardBuilder.Add to determine the file category before rewriting skipped content, then infer language afterward so direct ShardBuilder callers follow the same behavior as Builder.Add and keep content-aware categorization for skipped documents. Shard merging also reconstructs documents through ShardBuilder, so this PR carries the category already stored in the source shard into the rebuilt document. That keeps merge metadata-preserving instead of forcing category inference to rediscover information the original indexer already knew.

1 week ago

ranking: incorporate file signals into BM25F (#922) This PR reworks the way we incorporate file signals into BM25. Previously, we were applying them as a tie-breaker. But in dogfooding, we found that these rarely impact results, because it's so rare to have a tie in BM25 scores. Now, we take the file signal into account when computing BM25F. The interpretation is that this data lives in a separate 'field' that is half the priority of regular content.

1 year ago

all: run modernize across codebase (#919) The latest release of gopls has a feature called modernize which will update your code where it can to use modern go features/pkgs. https://github.com/golang/tools/releases/tag/gopls%2Fv0.18.0 Generated with: go run golang.org/x/tools/gopls/internal/analysis/modernize/cmd/modernize@latest -fix -test ./... Test Plan: CI

1 year ago

tombstones_test.go

Move root-level index code to index package (#902) In the repo root, we have a bunch of low level logic around index building and searching. So we end up exposing internal logic through the main public `zoekt` package, for example `zoekt.Merge(...)`. This PR moves it into the `build` package, so all code related to index building lives together. It then renames `build` to `index` to reflect the broader focus on indexing and searching the index.

1 year ago

index: 3.7x faster posting list construction via direct-indexed ASCII array (#1020) * index: reduce GC pressure in posting list construction by 1.8x Three targeted changes to newSearchableString, the hot path where ~54% of CPU was spent on runtime memory management (memclr, memmove, madvise, mapassign): 1. Merge postings and lastOffsets maps into a single map[ngram]*postingList. Pointer-stored values mean the map is only written when a new ngram is first seen (~282K writes for kubernetes) rather than on every trigram occurrence (~169M). This cuts per-trigram map operations from 4 to ~1. 2. Pre-size the map using estimateNgrams(shardMaxBytes) and pre-allocate each posting list to 1024 bytes, reducing slice growth events and eliminating map rehashing. 3. Pool postingsBuilder instances via sync.Pool on the Builder, so sequential and parallel shard builds reuse map and slice allocations across shards instead of re-creating them. Benchmarked on kubernetes (22.9K files, 169 MB, Apple M1 Max): Cold path (NewSearchableString): Time: 9.3s → 5.3s (-43%) Allocs: 901K → 677K (-25%) B/op: 1358 → 1536 MB (+13%, from larger pre-alloc) Warm path (pooled reuse across shards): Time: 9.3s → 5.1s (-45%) Allocs: 901K → 23K (-97%) B/op: 1358 → 0.6 MB (-99.96%) WritePostings: ~155ms, unchanged. * index: direct-indexed array for ASCII trigrams, 3.7x faster builds Replace map lookups with a direct-indexed [2M]*postingList array for ASCII trigrams (3 × 7-bit runes = 21-bit index, 16 MB of pointers). Since 99%+ of trigrams in source code are pure ASCII, this eliminates nearly all hash and probe overhead from the hot loop. Non-ASCII trigrams still fall back to the map. Also inline the ASCII check (data[0] < utf8.RuneSelf) to avoid utf8.DecodeRune function call overhead on the 95-99% of bytes that are ASCII. In writePostings, collect ngrams from both the array and map into a single sorted slice for writing. The on-disk format is unchanged. Benchmarked on kubernetes (22.9K files, 169 MB, Apple M1 Max): Cold path (NewSearchableString): Before (with map opt): 5.3s, 677K allocs, 1536 MB After: 2.5s, 676K allocs, 1544 MB Speedup: 2.1x (3.7x vs original baseline) Warm path (pooled reuse): Before (with map opt): 5.1s, 23K allocs, 0.6 MB After: 2.3s, 23K allocs, 0.6 MB Speedup: 2.2x (4.0x vs original baseline) WritePostings: ~130ms, unchanged. The non-ASCII map now holds only ~6K entries (vs ~282K before), since the vast majority of trigrams are served by the direct array. * index: fix stale comments after ASCII array introduction Update three comments that still referenced map-only storage after the direct-indexed ASCII array was added: postingList doc, estimateNgrams doc, and reset() doc. * index: address review feedback from keegancsmith - Replace putPostingsBuilder with returnPostingsBuilders(*ShardBuilder) that returns both content and name builders to the pool and nils the fields, so any subsequent misuse panics obviously. - Drop error-path pool returns in newShardBuilder: setRepository errors are extremely rare (invalid templates or >64 branches), not worth the code complexity. - Thread shardMax through newShardBuilder(shardMax int). Callers without a shard size context (merge, tests, public API) pass 0 for the default (100 MB). Builder.newShardBuilder passes b.opts.ShardMax via the pool. - Switch sort.Slice to slices.SortFunc in writePostings for type safety and to avoid interface boxing overhead. * index: sparse ASCII index, reduce initialPostingCap to 64 Address review feedback: - Add asciiPopulated []uint32 sparse index so reset() and writePostings iterate only populated slots (~275K) instead of scanning all 2M. Retains postingList allocations for pool reuse via len(pl.data)==0 detection on the hot path. - Reduce initialPostingCap from 1024 to 64. On kubernetes (282K trigrams), median posting list is 10 bytes and 78% are under 64. Drops pre-allocation waste from 244 MB to 11 MB. Cold-path B/op: 1558 MB → 1352 MB (-13%).

3 months ago