ngramoffset: add a dense asciiNgramOffset mapper and a combiner. (#90)
Most ngrams are of ASCII text and have small amounts of data. This
exploits this by packing three 7-bit runes and an 11-bit length into a
single 32-bit entry, with some extra logic to have periodic offsets to
easily reconstruct simpleSection{offset, length} with a fixed maximum
amount of extra computation.
This drops ngramOffset RAM from 3.6GB (69% of total) to 2.3GB (59%),
with asciiNgramOffsets taking 1.7GB (44%) and arrayNgramOffsets taking
0.6GB (15%).
Unfortunately, this increases CPU usage when loading the index--
readIndexData increases from 13s to 18s, with the main call of
readNgrams cpu time doubling from 5s to 10s.
This also adds a more comprehensive test to exercise the various
boundary conditions of the ascii/unicode splitting logic.