Indexing: respect indexing buffer limit (#686) · boltless.me/zoekt@2355607

fork of https://github.com/sourcegraph/zoekt

Indexing: respect indexing buffer limit (#686)

When indexing documents, we buffer up documents until we reach the shard size
limit (100MB), then flush the shard. If we decide to skip a document because
it's a binary file, then (naturally) we don't count its content size towards
the shard limit. But we still buffered the full document. So if there are a large
number of binary files, we could easily blow past the 100MB limit and run into
memory issues.

This change simply clears `Content` whenever `SkipReason` is set. The
invariant: a buffered document should only ever have `SkipReason` or `Content`,
not both.

author

Julie Tibshirani committer

GitHub date 2 years ago (Nov 10, 2023, 8:18 AM -0800) commit 2355607d 2355607d5ff884be315631cfb1f2ad27bacf10fc parent db067d12 db067d1294f3ac13d77c19c8259baa3a8c6c1b6a

2 changed files

Expand all

build

builder.go

builder_test.go

build/builder.go

··· 642 642 b.size += len(doc.Name) + len(doc.Content) 643 643 } else { 644 644 b.size += len(doc.Name) + len(doc.SkipReason) 645 + // Drop the content if we are skipping the document. Skipped content is not counted towards the 646 + // shard size limit, so otherwise we might buffer too much data in memory before flushing. 647 + doc.Content = nil 645 648 } 646 649 647 650 if b.size > b.opts.ShardMax {

build/builder_test.go

··· 244 244 if len(b.todo) != 1 || b.todo[0].SkipReason == "" { 245 245 t.Fatalf("document should have been skipped") 246 246 } 247 + if b.todo[0].Content != nil { 248 + t.Fatalf("document content should be empty") 249 + } 247 250 if b.size >= 100 { 248 251 t.Fatalf("content of skipped documents should not count towards shard size thresold") 249 252 }

Configure Feed

Configure Feed