bloom: add bloom filters over words in contents/filenames. · boltless.me/zoekt@a1cf13f

fork of https://github.com/sourcegraph/zoekt

bloom: add bloom filters over words in contents/filenames.

Bloom filters are probabilistic data structures with a 0% false negative
rate and a configurable false positive rate, which makes them ideal for
rejecting queries that are definitely not satisfiable before moving onto
more expensive evaluation steps.

These bloom filters store case-insensitive word fragments of length 4-8
from file contents and filenames matching the regex [a-zA-Z_]\w{3,7}.
These filters are then used during evaluation to skip shards with no
matches to literals in the input query, just like the trigram index
does. Because the bloom filter stores longer fragments than trigrams,
it has better precision and a lower false positive rate-- dropping from
roughly 25% with trigrams alone to under 1% with added bloom filters.

This adds an additional 100KB of per-shard mmap-backed data used in the
query paths, althought the effects of a cold cache are reduced by the
blocked bloom construction which implies having to page in at most
strlen(frag)-3 pages from disk to answer the query.

A novelty of the bloom filter implementation is that it supports a
shrinking operation based on bitwise or and a filter size divisible by
many other factors to achieve an arbitrary target load factor after the
initial dataset has been added to the filter. This saves some time in
construction, and makes it much easier to evaluate load factor false
positive rate tradeoffs.

The bloom filter function was carefully chosen to minimize false
positive rate for the given amount of overhead.

Testing indicates that the the filters can make some queries perform
~10x less disk I/O and run ~5x faster.

Change-Id: Ia3adf9e46b198036b3493e289d02b6ebb8bd58c2

author

Ryan Hitchman committer

Ryan Hitchman date 4 years ago (Sep 29, 2021, 12:50 PM -0600) commit a1cf13fc a1cf13fc7c3b8398b7a85b13e2e06a527bbd2906 parent 27969cc1 27969cc10346fada1d7145f6969219c894670344

+1039 -4

16 changed files

Expand all

api.go

bloom.go

bloom_test.go

cmd

zoekt-webserver

main.go

eval.go

index_test.go

indexbuilder.go

indexdata.go

matchtree.go

read.go

testdata

golden

TestReadSearch

repo17_v17.00000.golden

repo_v16.00000.golden

shards

repo17_v17.00000.zoekt

repo_v16.00000.zoekt

toc.go

write.go

Configure Feed

Configure Feed