doc/design.md at ddbde8b3ad23a2c37b8be13df827bd12446d63b0 · boltless.me/zoekt

boltless.me / zoekt
Fork 0
fork of https://github.com/sourcegraph/zoekt
Fork 0
zoekt / doc / design.md
at ddbde8b3ad23a2c37b8be13df827bd12446d63b0 335 lines 11 kB View raw View rendered
wrap content
Keegan Carruthers-Smith docs: some updates (#952) 1y ago
080acfdf
  1
  2
  3OBJECTIVE
  4=========
  5
  6Provide full text code search for git based corpuses.
  7
  8Goals:
  9
 10* sub-50ms results on large codebases, such as Android (~2G text) or
 11  Chrome
 12
 13* works well on a single standard Linux machine, with stable storage on SSD
 14
 15* search multiple repositories and multiple branches.
 16
 17* provide rich query language, with boolean operators
 18
 19* integrate with Gerrit/Gitiles code review/browsing system
 20
 21
 22SEARCHING AND INDEXING
 23======================
 24
 25
 26Positional trigrams
 27-------------------
 28
 29We build an index of ngrams (n=3), where we store the offset of each
 30ngram's occurrence within a file.  For example, if the corpus is "banana"
 31then we generate the index
 32
 33    "ban": 0
 34    "ana": 1,3
 35    "nan": 2
 36
 37If we are searching for a string (eg. "The quick brown fox"), then we
 38look for two trigrams (eg. "The" and "fox"), and check that they are
 39found at the right distance apart.
 40
 41Regular expressions are handled by extracting normal strings from the regular
 42expressions. For example, to search for
 43
 44    (Path|PathFragment).*=.*/usr/local
 45
 46we look for
 47
 48    (AND (OR substr:"Path" substr:"PathFragment") substr:"/usr/local")
 49
 50and any documents thus found would be searched for the regular
 51expression.
 52
 53Compared to indexing 3-grams on a per-file basis, as described
 54[here](https://swtch.com/~rsc/regexp/regexp4.html), there are some advantages:
 55
 56* for each substring, we only have to intersect just a couple of posting-lists:
 57  one for the beginning, and one for the end.
 58
 59* Since we touch few posting lists per query, they can be stored on
 60  slower media, such as SSD.
 61
 62* we can select any pair of trigrams from the pattern for which the
 63  number of matches is minimal. For example, we could search for "qui"
 64  rather than "the".
 65
 66There are some downsides compared to trigrams:
 67
 68* The index is large. Empirically, it is about 3x the corpus size, composed of
 69  2x (offsets), and 1x (original content). However, since we have to look at
 70  just a limited number of ngrams, we don't have to keep the index in memory.
 71
 72Compared to [suffix
 73arrays](https://blog.nelhage.com/2015/02/regular-expression-search-with-suffix-arrays/),
 74there are the following advantages:
 75
 76* The index construction is straightforward, and can easily be made
 77  incremental.
 78
 79* Since the posting lists for a trigram can be stored on SSD,
 80  searching with positional trigrams only requires 1.2x corpus size of
 81  RAM.
 82
 83* All the matches are returned in document order. This makes it
 84  straightforward to process compound boolean queries with AND and OR.
 85
 86Downsides compared to suffix array:
 87
 88* there is no way to transform regular expressions into index ranges into
 89  the suffix array.
 90
 91
 92Case sensitivity
 93----------------
 94
 95Code usually is searched without regard for case. In this case, when
 96we are looking for "abc", we look for occurrences of all the different
 97case variants, ie. {"abc", "Abc", "aBc", "ABc", ... }, and then
 98compare the candidate matches without regard for case.
 99
100
101UTF-8
102-----
103
104UTF-8 is the defacto encoding for unicode. Zoekt assumes that files
105are UTF-8 encoded.  Characters have differing widths in UTF-8, so we
106use rune offsets in the trigram index, and convert those back to bytes
107with a lookup table: every 100 runes, we store the rune-index to
108byte-index mapping. For corpuses that are completely ASCII (fairly
109normal for source code), we short-circuit this lookup.
110
111
112Branches
113--------
114
115Each file blob in the index has a bitmask, representing the branches
116in which the content is found, eg:
117
118    branches: [master=1, staging=2, stable=4]
119    file "x.java", branch mask=3
120    file "x.java", branch mask=4
121
122in this case, the index holds two versions of "x.java", the one
123present in "master" and "staging", and the one in the "stable" branch.
124
125With this technique, we can index many similar branches of a
126repository with little space overhead.
127
128
129Index format
130------------
131
132The index is organized in shards, where each shard is a file, laid out
133such that it can be mmap'd efficiently.
134
135Each shard contains data for one code repository. The basic data in an
136index shard are the following
137
138   * file contents
139   * filenames
140   * the content posting lists (varint encoded)
141   * the filename posting lists (varint encoded)
142   * branch masks
143   * metadata (repository name, index format version, etc.)
144
145In practice, the shard size is about 3.5x the corpus size, composed of
146original content, posting lists, and other metadata.
147
148The format uses uint32 for all offsets, so the total size of a shard
149should be below 4G. Given the size of the posting data, this caps
150content size per shard at 1G.
151
152Currently, within a shard, a single goroutine searches all documents,
153so the shard size determines the amount of parallelism, and large
154repositories should be split across multiple shards to achieve good
155performance.
156
157The metadata section contains a version number (which by convention is
158also part of the file name of the shard). This provides a smooth
159upgrade path across format versions: generate shards in the new
160format, kill old search service, start new search service, delete old
161shards.
162
163
164Ranking
165-------
166
167In absense of advanced signals (e.g. pagerank on symbol references),
168ranking options are limited: the following signals could be used for
169ranking
170
171   * number of atoms matched
172   * closeness to matches for other atoms
173   * quality of match: does match boundary coincide with a word boundary?
174   * file latest update time
175   * filename lengh
176   * tokenizer ranking: does a match fall comment or string literal?
177   * symbol ranking: it the match a symbol definition?
178
179For the latter, it is necessary to find symbol definitions and other
180sections within files on indexing. Several (imperfect) programs to do
181this already exist, eg. `ctags`.
182
183Zoekt also supports an alternative BM25-based scoring algorithm that can be
184enabled with `UseBM25Scoring`. When enabled, each match in a file is treated
185as a term, and an approximation to BM25 is computed. This is useful for
186multi-term queries, better handling of term frequency, and appropriate
187document length normalization.
188
189
190Query language
191--------------
192
193Queries are stored as expression trees, using the following data
194structure:
195
196    Query:
197        Atom
198        | AND QueryList
199        | OR QueryList
200        | NOT Query
201        ;
202
203    Atom:
204        ConstQuery
205        | SubStringQuery
206        | RegexpQuery
207        | RepoQuery
208        | BranchQuery
209        ;
210
211Both SubStringQuery and RegexpQuery can apply to either file or
212contents, and can optionally be case-insensitive.
213
214ConstQuery (match everything, or match nothing) is a useful construct
215for partial evaluation of a query: for each index shard through which
216we search, we partially evaluate the query, eg. when the query is
217
218    and[substr:"needle" repo:"zoekt"]
219
220then we can rewrite the query to FALSE if we are looking at a shard
221for repository "bazel", skipping the entire shard.
222
223Each query must have at least one positive atom. Negations can only
224serve to prune results generated by positive atoms.
225
226
227Query parsing
228-------------
229
230Strings in the input language are considered regular expressions
231but literal regular expressions are simplified to Substring queries,
232
233    a.*b => regexp:"a.*b"
234    a\.b => substring:"a.b"
235
236leading modifiers select different types of atoms, eg.
237
238    file:java => Substring_file:"java"
239    branch:master => Repo:"master"
240
241parentheses inside a string (possibly with escaped spaces) are
242interpreted as regular expressions, otherwise they are used for grouping
243
244    (abc def) => and[substring:"abc" substring:"def"]
245    (abc\ def) => regexp:"(abc def)"
246
247there is an implicit "AND" between elements of a parenthesized list.
248There is an "OR" operator, which has lower priority than the implicit
249"AND":
250
251    ppp qqq or rrr sss => or[and[substring:"ppp" substring:"qqq"] and[substring:"rrr" substring:"sss"]]
252
253
254GERRIT/GITILES INTEGRATION
255==========================
256
257Gerrit is a popular system for code review on open source
258projects. Its sister project Gitiles provides a browsing experience.
259
260Any code search integration with Gerrit should be made available in
261Gitiles.  Gerrit/Gitiles has a complex ACL system, so a codesearch
262solution for Gerrit/Gitiles should respect these ACLs.
263
264Since Gitiles knows the identity of the logged-in user, it can
265construct search queries that respect ACLs, and even filter results
266afterwards if necessary. In such a setup, only Gitiles is allowed to
267talk to the search service, so it should be protected from general
268access, e.g. by requiring authentication.
269
270A codesearch implementation for Gitiles would change Gitiles to show a
271search box on pages relating to a repository. When searching, Gitiles
272would also render the search results. The process is as follows:
273
274   * On receiving a query, Gitiles finds the list of branches visible to the user
275   * Gitiles sends the raw query, along with branches and repository to the search service
276   * The search service parses the query, and embeds it as follows
277
278    (AND original-query repo:REPO (OR "branch:visible-1" "branch:visible-2" .. ))
279
280   * The search service returns the search results, leaving it to
281     gitiles to render them. Gitiles can apply any further filtering
282     as necessary.
283
284
285SERVICE MANAGEMENT
286==================
287
288The above details how indexing and searching works. A fully fledged
289system also crawls repositories and (re)indexes them. Since the system
290is designed to run on a single machine, we provide a service
291management tool, with the following responsibilities:
292
293   * Poll git hosting sites (eg. github.com, googlesource.com), to fetch new updates
294   * Reindex any changed repositories
295   * Run the webserver; and restart if it goes down for any reason
296   * Delete old webserver logs
297
298
299Security
300--------
301
302This section assumes that 'zoekt' is used as a public facing
303webserver, indexing publicly available data, serving on HTTPS without
304authentication.
305
306Since the UI is unauthenticated, there are no authentication secrets to steal.
307
308Since the code is public, there is no sensitive code to steal.
309
310This leaves us with the following senstive data:
311
312   * Credentials for accesssing git repostitories (eg. github access token)
313   * TLS server certificates
314   * Query logs
315
316The system handles the following untrusted data:
317
318   * code in git repositories
319   * search queries
320
321Since 'zoekt' itself is written in Go, it does not have memory
322security problems: at worst, a bug in the query parser would lead to a
323crash.
324
325The code to index is handled by `ctags` for symbol detection. The
326security risk this poses is mitigated by using a seccomp based
327sandboxing.
328
329
330Privacy
331-------
332
333Webserver logs can contain privacy sensitive data (such as IP
334addresses and search queries). For this reason, the service management
335tool deletes them after a configurable period of time.
Configure Feed

Configure Feed