bloom: add bloom filters over words in contents/filenames.
Bloom filters are probabilistic data structures with a 0% false negative
rate and a configurable false positive rate, which makes them ideal for
rejecting queries that are definitely not satisfiable before moving onto
more expensive evaluation steps.
These bloom filters store case-insensitive word fragments of length 4-8
from file contents and filenames matching the regex [a-zA-Z_]\w{3,7}.
These filters are then used during evaluation to skip shards with no
matches to literals in the input query, just like the trigram index
does. Because the bloom filter stores longer fragments than trigrams,
it has better precision and a lower false positive rate-- dropping from
roughly 25% with trigrams alone to under 1% with added bloom filters.
This adds an additional 100KB of per-shard mmap-backed data used in the
query paths, althought the effects of a cold cache are reduced by the
blocked bloom construction which implies having to page in at most
strlen(frag)-3 pages from disk to answer the query.
A novelty of the bloom filter implementation is that it supports a
shrinking operation based on bitwise or and a filter size divisible by
many other factors to achieve an arbitrary target load factor after the
initial dataset has been added to the filter. This saves some time in
construction, and makes it much easier to evaluate load factor false
positive rate tradeoffs.
The bloom filter function was carefully chosen to minimize false
positive rate for the given amount of overhead.
Testing indicates that the the filters can make some queries perform
~10x less disk I/O and run ~5x faster.
Change-Id: Ia3adf9e46b198036b3493e289d02b6ebb8bd58c2
This is a binary file and will not be displayed.
This is a binary file and will not be displayed.