fork of https://github.com/sourcegraph/zoekt
1
2
3OBJECTIVE
4=========
5
6Provide full text code search for git based corpuses.
7
8Goals:
9
10* sub-50ms results on large codebases, such as Android (~2G text) or
11 Chrome
12
13* works well on a single standard Linux machine, with stable storage on SSD
14
15* search multiple repositories and multiple branches.
16
17* provide rich query language, with boolean operators
18
19* integrate with Gerrit/Gitiles code review/browsing system
20
21
22SEARCHING AND INDEXING
23======================
24
25
26Positional trigrams
27-------------------
28
29We build an index of ngrams (n=3), where we store the offset of each
30ngram's occurrence within a file. For example, if the corpus is "banana"
31then we generate the index
32
33 "ban": 0
34 "ana": 1,3
35 "nan": 2
36
37If we are searching for a string (eg. "The quick brown fox"), then we
38look for two trigrams (eg. "The" and "fox"), and check that they are
39found at the right distance apart.
40
41Regular expressions are handled by extracting normal strings from the regular
42expressions. For example, to search for
43
44 (Path|PathFragment).*=.*/usr/local
45
46we look for
47
48 (AND (OR substr:"Path" substr:"PathFragment") substr:"/usr/local")
49
50and any documents thus found would be searched for the regular
51expression.
52
53Compared to indexing 3-grams on a per-file basis, as described
54[here](https://swtch.com/~rsc/regexp/regexp4.html), there are some advantages:
55
56* for each substring, we only have to intersect just a couple of posting-lists:
57 one for the beginning, and one for the end.
58
59* Since we touch few posting lists per query, they can be stored on
60 slower media, such as SSD.
61
62* we can select any pair of trigrams from the pattern for which the
63 number of matches is minimal. For example, we could search for "qui"
64 rather than "the".
65
66There are some downsides compared to trigrams:
67
68* The index is large. Empirically, it is about 3x the corpus size, composed of
69 2x (offsets), and 1x (original content). However, since we have to look at
70 just a limited number of ngrams, we don't have to keep the index in memory.
71
72Compared to [suffix
73arrays](https://blog.nelhage.com/2015/02/regular-expression-search-with-suffix-arrays/),
74there are the following advantages:
75
76* The index construction is straightforward, and can easily be made
77 incremental.
78
79* Since the posting lists for a trigram can be stored on SSD,
80 searching with positional trigrams only requires 1.2x corpus size of
81 RAM.
82
83* All the matches are returned in document order. This makes it
84 straightforward to process compound boolean queries with AND and OR.
85
86Downsides compared to suffix array:
87
88* there is no way to transform regular expressions into index ranges into
89 the suffix array.
90
91
92Case sensitivity
93----------------
94
95Code usually is searched without regard for case. In this case, when
96we are looking for "abc", we look for occurrences of all the different
97case variants, ie. {"abc", "Abc", "aBc", "ABc", ... }, and then
98compare the candidate matches without regard for case.
99
100
101UTF-8
102-----
103
104UTF-8 is the defacto encoding for unicode. Zoekt assumes that files
105are UTF-8 encoded. Characters have differing widths in UTF-8, so we
106use rune offsets in the trigram index, and convert those back to bytes
107with a lookup table: every 100 runes, we store the rune-index to
108byte-index mapping. For corpuses that are completely ASCII (fairly
109normal for source code), we short-circuit this lookup.
110
111
112Branches
113--------
114
115Each file blob in the index has a bitmask, representing the branches
116in which the content is found, eg:
117
118 branches: [master=1, staging=2, stable=4]
119 file "x.java", branch mask=3
120 file "x.java", branch mask=4
121
122in this case, the index holds two versions of "x.java", the one
123present in "master" and "staging", and the one in the "stable" branch.
124
125With this technique, we can index many similar branches of a
126repository with little space overhead.
127
128
129Index format
130------------
131
132The index is organized in shards, where each shard is a file, laid out
133such that it can be mmap'd efficiently.
134
135Each shard contains data for one code repository. The basic data in an
136index shard are the following
137
138 * file contents
139 * filenames
140 * the content posting lists (varint encoded)
141 * the filename posting lists (varint encoded)
142 * branch masks
143 * metadata (repository name, index format version, etc.)
144
145In practice, the shard size is about 3x the corpus (size).
146
147The format uses uint32 for all offsets, so the total size of a shard
148should be below 4G. Given the size of the posting data, this caps
149content size per shard at 1G.
150
151Currently, within a shard, a single goroutine searches all documents,
152so the shard size determines the amount of parallelism, and large
153repositories should be split across multiple shards to achieve good
154performance.
155
156The metadata section contains a version number (which by convention is
157also part of the file name of the shard). This provides a smooth
158upgrade path across format versions: generate shards in the new
159format, kill old search service, start new search service, delete old
160shards.
161
162
163Ranking
164-------
165
166In absense of advanced signals (e.g. pagerank on symbol references),
167ranking options are limited: the following signals could be used for
168ranking
169
170 * number of atoms matched
171 * closeness to matches for other atoms
172 * quality of match: does match boundary coincide with a word boundary?
173 * file latest update time
174 * filename lengh
175 * tokenizer ranking: does a match fall comment or string literal?
176 * symbol ranking: it the match a symbol definition?
177
178For the latter, it is necessary to find symbol definitions and other
179sections within files on indexing. Several (imperfect) programs to do
180this already exist, eg. `ctags`.
181
182
183Query language
184--------------
185
186Queries are stored as expression trees, using the following data
187structure:
188
189 Query:
190 Atom
191 | AND QueryList
192 | OR QueryList
193 | NOT Query
194 ;
195
196 Atom:
197 ConstQuery
198 | SubStringQuery
199 | RegexpQuery
200 | RepoQuery
201 | BranchQuery
202 ;
203
204Both SubStringQuery and RegexpQuery can apply to either file or
205contents, and can optionally be case-insensitive.
206
207ConstQuery (match everything, or match nothing) is a useful construct
208for partial evaluation of a query: for each index shard through which
209we search, we partially evaluate the query, eg. when the query is
210
211 and[substr:"needle" repo:"zoekt"]
212
213then we can rewrite the query to FALSE if we are looking at a shard
214for repository "bazel", skipping the entire shard.
215
216Each query must have at least one positive atom. Negations can only
217serve to prune results generated by positive atoms.
218
219
220Query parsing
221-------------
222
223Strings in the input language are considered regular expressions
224but literal regular expressions are simplified to Substring queries,
225
226 a.*b => regexp:"a.*b"
227 a\.b => substring:"a.b"
228
229leading modifiers select different types of atoms, eg.
230
231 file:java => Substring_file:"java"
232 branch:master => Repo:"master"
233
234parentheses inside a string (possibly with escaped spaces) are
235interpreted as regular expressions, otherwise they are used for grouping
236
237 (abc def) => and[substring:"abc" substring:"def"]
238 (abc\ def) => regexp:"(abc def)"
239
240there is an implicit "AND" between elements of a parenthesized list.
241There is an "OR" operator, which has lower priority than the implicit
242"AND":
243
244 ppp qqq or rrr sss => or[and[substring:"ppp" substring:"qqq"] and[substring:"rrr" substring:"sss"]]
245
246
247GERRIT/GITILES INTEGRATION
248==========================
249
250Gerrit is a popular system for code review on open source
251projects. Its sister project Gitiles provides a browsing experience.
252
253Any code search integration with Gerrit should be made available in
254Gitiles. Gerrit/Gitiles has a complex ACL system, so a codesearch
255solution for Gerrit/Gitiles should respect these ACLs.
256
257Since Gitiles knows the identity of the logged-in user, it can
258construct search queries that respect ACLs, and even filter results
259afterwards if necessary. In such a setup, only Gitiles is allowed to
260talk to the search service, so it should be protected from general
261access, e.g. by requiring authentication.
262
263A codesearch implementation for Gitiles would change Gitiles to show a
264search box on pages relating to a repository. When searching, Gitiles
265would also render the search results. The process is as follows:
266
267 * On receiving a query, Gitiles finds the list of branches visible to the user
268 * Gitiles sends the raw query, along with branches and repository to the search service
269 * The search service parses the query, and embeds it as follows
270
271 (AND original-query repo:REPO (OR "branch:visible-1" "branch:visible-2" .. ))
272
273 * The search service returns the search results, leaving it to
274 gitiles to render them. Gitiles can apply any further filtering
275 as necessary.
276
277
278SERVICE MANAGEMENT
279==================
280
281The above details how indexing and searching works. A fully fledged
282system also crawls repositories and (re)indexes them. Since the system
283is designed to run on a single machine, we provide a service
284management tool, with the following responsibilities:
285
286 * Poll git hosting sites (eg. github.com, googlesource.com), to fetch new updates
287 * Reindex any changed repositories
288 * Run the webserver; and restart if it goes down for any reason
289 * Delete old webserver logs
290
291
292Security
293--------
294
295This section assumes that 'zoekt' is used as a public facing
296webserver, indexing publicly available data, serving on HTTPS without
297authentication.
298
299Since the UI is unauthenticated, there are no authentication secrets to steal.
300
301Since the code is public, there is no sensitive code to steal.
302
303This leaves us with the following senstive data:
304
305 * Credentials for accesssing git repostitories (eg. github access token)
306 * TLS server certificates
307 * Query logs
308
309The system handles the following untrusted data:
310
311 * code in git repositories
312 * search queries
313
314Since 'zoekt' itself is written in Go, it does not have memory
315security problems: at worst, a bug in the query parser would lead to a
316crash.
317
318The code to index is handled by `ctags` for symbol detection. The
319security risk this poses is mitigated by using a seccomp based
320sandboxing.
321
322
323Privacy
324-------
325
326Webserver logs can contain privacy sensitive data (such as IP
327addresses and search queries). For this reason, the service management
328tool deletes them after a configurable period of time.