fork of https://github.com/sourcegraph/zoekt
0

Configure Feed

Select the types of activity you want to include in your feed.

1 2 3OBJECTIVE 4========= 5 6Provide full text code search for git based corpuses. 7 8Goals: 9 10* sub-50ms results on large codebases, such as Android (~2G text) or 11 Chrome 12 13* works well on a single standard Linux machine, with stable storage on SSD 14 15* search multiple repositories and multiple branches. 16 17* provide rich query language, with boolean operators 18 19* integrate with Gerrit/Gitiles code review/browsing system 20 21 22SEARCHING AND INDEXING 23====================== 24 25 26Positional trigrams 27------------------- 28 29We build an index of ngrams (n=3), where we store the offset of each 30ngram's occurrence within a file. For example, if the corpus is "banana" 31then we generate the index 32 33 "ban": 0 34 "ana": 1,3 35 "nan": 2 36 37If we are searching for a string (eg. "The quick brown fox"), then we 38look for two trigrams (eg. "The" and "fox"), and check that they are 39found at the right distance apart. 40 41Regular expressions are handled by extracting normal strings from the regular 42expressions. For example, to search for 43 44 (Path|PathFragment).*=.*/usr/local 45 46we look for 47 48 (AND (OR substr:"Path" substr:"PathFragment") substr:"/usr/local") 49 50and any documents thus found would be searched for the regular 51expression. 52 53Compared to indexing 3-grams on a per-file basis, as described 54[here](https://swtch.com/~rsc/regexp/regexp4.html), there are some advantages: 55 56* for each substring, we only have to intersect just a couple of posting-lists: 57 one for the beginning, and one for the end. 58 59* Since we touch few posting lists per query, they can be stored on 60 slower media, such as SSD. 61 62* we can select any pair of trigrams from the pattern for which the 63 number of matches is minimal. For example, we could search for "qui" 64 rather than "the". 65 66There are some downsides compared to trigrams: 67 68* The index is large. Empirically, it is about 3x the corpus size, composed of 69 2x (offsets), and 1x (original content). However, since we have to look at 70 just a limited number of ngrams, we don't have to keep the index in memory. 71 72Compared to [suffix 73arrays](https://blog.nelhage.com/2015/02/regular-expression-search-with-suffix-arrays/), 74there are the following advantages: 75 76* The index construction is straightforward, and can easily be made 77 incremental. 78 79* Since the posting lists for a trigram can be stored on SSD, 80 searching with positional trigrams only requires 1.2x corpus size of 81 RAM. 82 83* All the matches are returned in document order. This makes it 84 straightforward to process compound boolean queries with AND and OR. 85 86Downsides compared to suffix array: 87 88* there is no way to transform regular expressions into index ranges into 89 the suffix array. 90 91 92Case sensitivity 93---------------- 94 95Code usually is searched without regard for case. In this case, when 96we are looking for "abc", we look for occurrences of all the different 97case variants, ie. {"abc", "Abc", "aBc", "ABc", ... }, and then 98compare the candidate matches without regard for case. 99 100 101UTF-8 102----- 103 104UTF-8 is the defacto encoding for unicode. Zoekt assumes that files 105are UTF-8 encoded. Characters have differing widths in UTF-8, so we 106use rune offsets in the trigram index, and convert those back to bytes 107with a lookup table: every 100 runes, we store the rune-index to 108byte-index mapping. For corpuses that are completely ASCII (fairly 109normal for source code), we short-circuit this lookup. 110 111 112Branches 113-------- 114 115Each file blob in the index has a bitmask, representing the branches 116in which the content is found, eg: 117 118 branches: [master=1, staging=2, stable=4] 119 file "x.java", branch mask=3 120 file "x.java", branch mask=4 121 122in this case, the index holds two versions of "x.java", the one 123present in "master" and "staging", and the one in the "stable" branch. 124 125With this technique, we can index many similar branches of a 126repository with little space overhead. 127 128 129Index format 130------------ 131 132The index is organized in shards, where each shard is a file, laid out 133such that it can be mmap'd efficiently. 134 135Each shard contains data for one code repository. The basic data in an 136index shard are the following 137 138 * file contents 139 * filenames 140 * the content posting lists (varint encoded) 141 * the filename posting lists (varint encoded) 142 * branch masks 143 * metadata (repository name, index format version, etc.) 144 145In practice, the shard size is about 3x the corpus (size). 146 147The format uses uint32 for all offsets, so the total size of a shard 148should be below 4G. Given the size of the posting data, this caps 149content size per shard at 1G. 150 151Currently, within a shard, a single goroutine searches all documents, 152so the shard size determines the amount of parallelism, and large 153repositories should be split across multiple shards to achieve good 154performance. 155 156The metadata section contains a version number (which by convention is 157also part of the file name of the shard). This provides a smooth 158upgrade path across format versions: generate shards in the new 159format, kill old search service, start new search service, delete old 160shards. 161 162 163Ranking 164------- 165 166In absense of advanced signals (e.g. pagerank on symbol references), 167ranking options are limited: the following signals could be used for 168ranking 169 170 * number of atoms matched 171 * closeness to matches for other atoms 172 * quality of match: does match boundary coincide with a word boundary? 173 * file latest update time 174 * filename lengh 175 * tokenizer ranking: does a match fall comment or string literal? 176 * symbol ranking: it the match a symbol definition? 177 178For the latter, it is necessary to find symbol definitions and other 179sections within files on indexing. Several (imperfect) programs to do 180this already exist, eg. `ctags`. 181 182 183Query language 184-------------- 185 186Queries are stored as expression trees, using the following data 187structure: 188 189 Query: 190 Atom 191 | AND QueryList 192 | OR QueryList 193 | NOT Query 194 ; 195 196 Atom: 197 ConstQuery 198 | SubStringQuery 199 | RegexpQuery 200 | RepoQuery 201 | BranchQuery 202 ; 203 204Both SubStringQuery and RegexpQuery can apply to either file or 205contents, and can optionally be case-insensitive. 206 207ConstQuery (match everything, or match nothing) is a useful construct 208for partial evaluation of a query: for each index shard through which 209we search, we partially evaluate the query, eg. when the query is 210 211 and[substr:"needle" repo:"zoekt"] 212 213then we can rewrite the query to FALSE if we are looking at a shard 214for repository "bazel", skipping the entire shard. 215 216Each query must have at least one positive atom. Negations can only 217serve to prune results generated by positive atoms. 218 219 220Query parsing 221------------- 222 223Strings in the input language are considered regular expressions 224but literal regular expressions are simplified to Substring queries, 225 226 a.*b => regexp:"a.*b" 227 a\.b => substring:"a.b" 228 229leading modifiers select different types of atoms, eg. 230 231 file:java => Substring_file:"java" 232 branch:master => Repo:"master" 233 234parentheses inside a string (possibly with escaped spaces) are 235interpreted as regular expressions, otherwise they are used for grouping 236 237 (abc def) => and[substring:"abc" substring:"def"] 238 (abc\ def) => regexp:"(abc def)" 239 240there is an implicit "AND" between elements of a parenthesized list. 241There is an "OR" operator, which has lower priority than the implicit 242"AND": 243 244 ppp qqq or rrr sss => or[and[substring:"ppp" substring:"qqq"] and[substring:"rrr" substring:"sss"]] 245 246 247GERRIT/GITILES INTEGRATION 248========================== 249 250Gerrit is a popular system for code review on open source 251projects. Its sister project Gitiles provides a browsing experience. 252 253Any code search integration with Gerrit should be made available in 254Gitiles. Gerrit/Gitiles has a complex ACL system, so a codesearch 255solution for Gerrit/Gitiles should respect these ACLs. 256 257Since Gitiles knows the identity of the logged-in user, it can 258construct search queries that respect ACLs, and even filter results 259afterwards if necessary. In such a setup, only Gitiles is allowed to 260talk to the search service, so it should be protected from general 261access, e.g. by requiring authentication. 262 263A codesearch implementation for Gitiles would change Gitiles to show a 264search box on pages relating to a repository. When searching, Gitiles 265would also render the search results. The process is as follows: 266 267 * On receiving a query, Gitiles finds the list of branches visible to the user 268 * Gitiles sends the raw query, along with branches and repository to the search service 269 * The search service parses the query, and embeds it as follows 270 271 (AND original-query repo:REPO (OR "branch:visible-1" "branch:visible-2" .. )) 272 273 * The search service returns the search results, leaving it to 274 gitiles to render them. Gitiles can apply any further filtering 275 as necessary. 276 277 278SERVICE MANAGEMENT 279================== 280 281The above details how indexing and searching works. A fully fledged 282system also crawls repositories and (re)indexes them. Since the system 283is designed to run on a single machine, we provide a service 284management tool, with the following responsibilities: 285 286 * Poll git hosting sites (eg. github.com, googlesource.com), to fetch new updates 287 * Reindex any changed repositories 288 * Run the webserver; and restart if it goes down for any reason 289 * Delete old webserver logs 290 291 292Security 293-------- 294 295This section assumes that 'zoekt' is used as a public facing 296webserver, indexing publicly available data, serving on HTTPS without 297authentication. 298 299Since the UI is unauthenticated, there are no authentication secrets to steal. 300 301Since the code is public, there is no sensitive code to steal. 302 303This leaves us with the following senstive data: 304 305 * Credentials for accesssing git repostitories (eg. github access token) 306 * TLS server certificates 307 * Query logs 308 309The system handles the following untrusted data: 310 311 * code in git repositories 312 * search queries 313 314Since 'zoekt' itself is written in Go, it does not have memory 315security problems: at worst, a bug in the query parser would lead to a 316crash. 317 318The code to index is handled by `ctags` for symbol detection. The 319security risk this poses is mitigated by using a seccomp based 320sandboxing. 321 322 323Privacy 324------- 325 326Webserver logs can contain privacy sensitive data (such as IP 327addresses and search queries). For this reason, the service management 328tool deletes them after a configurable period of time.