fork of https://github.com/sourcegraph/zoekt
0

Configure Feed

Select the types of activity you want to include in your feed.

1 2 3OBJECTIVE 4========= 5 6Provide full text code search for git based corpuses. 7 8Goals: 9 10* sub-50ms results on large codebases, such as Android (~2G text) or 11 Chrome 12 13* works well on a single standard Linux machine, with stable storage on SSD 14 15* search multiple repositories and multiple branches. 16 17* provide rich query language, with boolean operators 18 19* integrate with Gerrit/Gitiles code review/browsing system 20 21 22SEARCHING AND INDEXING 23====================== 24 25 26Positional trigrams 27------------------- 28 29We build an index of ngrams (n=3), where we store the offset of each 30ngram's occurrence within a file. For example, if the corpus is "banana" 31then we generate the index 32 33 "ban": 0 34 "ana": 1,3 35 "nan": 2 36 37If we are searching for a string (eg. "The quick brown fox"), then we 38look for two trigrams (eg. "The" and "fox"), and check that they are 39found at the right distance apart. 40 41Regular expressions are handled by extracting normal strings from the regular 42expressions. For example, to search for 43 44 (Path|PathFragment).*=.*/usr/local 45 46we look for 47 48 (AND (OR substr:"Path" substr:"PathFragment") substr:"/usr/local") 49 50and any documents thus found would be searched for the regular 51expression. 52 53Compared to indexing 3-grams on a per-file basis, as described 54[here](https://swtch.com/~rsc/regexp/regexp4.html), there are some advantages: 55 56* for each substring, we only have to intersect just a couple of posting-lists: 57 one for the beginning, and one for the end. 58 59* Since we touch few posting lists per query, they can be stored on 60 slower media, such as SSD. 61 62* we can select any pair of trigrams from the pattern for which the 63 number of matches is minimal. For example, we could search for "qui" 64 rather than "the". 65 66There are some downsides compared to trigrams: 67 68* The index is large. Empirically, it is about 3x the corpus size, composed of 69 2x (offsets), and 1x (original content). However, since we have to look at 70 just a limited number of ngrams, we don't have to keep the index in memory. 71 72Compared to [suffix 73arrays](https://blog.nelhage.com/2015/02/regular-expression-search-with-suffix-arrays/), 74there are the following advantages: 75 76* The index construction is straightforward, and can easily be made 77 incremental. 78 79* Since the posting lists for a trigram can be stored on SSD, 80 searching with positional trigrams only requires 1.2x corpus size of 81 RAM. 82 83* All the matches are returned in document order. This makes it 84 straightforward to process compound boolean queries with AND and OR. 85 86Downsides compared to suffix array: 87 88* there is no way to transform regular expressions into index ranges into 89 the suffix array. 90 91 92Case sensitivity 93---------------- 94 95Code usually is searched without regard for case. In this case, when 96we are looking for "abc", we look for occurrences of all the different 97case variants, ie. {"abc", "Abc", "aBc", "ABc", ... }, and then 98compare the candidate matches without regard for case. 99 100 101UTF-8 102----- 103 104UTF-8 is the defacto encoding for unicode. Zoekt assumes that files 105are UTF-8 encoded. Characters have differing widths in UTF-8, so we 106use rune offsets in the trigram index, and convert those back to bytes 107with a lookup table: every 100 runes, we store the rune-index to 108byte-index mapping. For corpuses that are completely ASCII (fairly 109normal for source code), we short-circuit this lookup. 110 111 112Branches 113-------- 114 115Each file blob in the index has a bitmask, representing the branches 116in which the content is found, eg: 117 118 branches: [master=1, staging=2, stable=4] 119 file "x.java", branch mask=3 120 file "x.java", branch mask=4 121 122in this case, the index holds two versions of "x.java", the one 123present in "master" and "staging", and the one in the "stable" branch. 124 125With this technique, we can index many similar branches of a 126repository with little space overhead. 127 128 129Index format 130------------ 131 132The index is organized in shards, where each shard is a file, laid out 133such that it can be mmap'd efficiently. 134 135Each shard contains data for one code repository. The basic data in an 136index shard are the following 137 138 * file contents 139 * filenames 140 * the content posting lists (varint encoded) 141 * the filename posting lists (varint encoded) 142 * branch masks 143 * metadata (repository name, index format version, etc.) 144 145In practice, the shard size is about 3.5x the corpus size, composed of 146original content, posting lists, and other metadata. 147 148The format uses uint32 for all offsets, so the total size of a shard 149should be below 4G. Given the size of the posting data, this caps 150content size per shard at 1G. 151 152Currently, within a shard, a single goroutine searches all documents, 153so the shard size determines the amount of parallelism, and large 154repositories should be split across multiple shards to achieve good 155performance. 156 157The metadata section contains a version number (which by convention is 158also part of the file name of the shard). This provides a smooth 159upgrade path across format versions: generate shards in the new 160format, kill old search service, start new search service, delete old 161shards. 162 163 164Ranking 165------- 166 167In absense of advanced signals (e.g. pagerank on symbol references), 168ranking options are limited: the following signals could be used for 169ranking 170 171 * number of atoms matched 172 * closeness to matches for other atoms 173 * quality of match: does match boundary coincide with a word boundary? 174 * file latest update time 175 * filename lengh 176 * tokenizer ranking: does a match fall comment or string literal? 177 * symbol ranking: it the match a symbol definition? 178 179For the latter, it is necessary to find symbol definitions and other 180sections within files on indexing. Several (imperfect) programs to do 181this already exist, eg. `ctags`. 182 183Zoekt also supports an alternative BM25-based scoring algorithm that can be 184enabled with `UseBM25Scoring`. When enabled, each match in a file is treated 185as a term, and an approximation to BM25 is computed. This is useful for 186multi-term queries, better handling of term frequency, and appropriate 187document length normalization. 188 189 190Query language 191-------------- 192 193Queries are stored as expression trees, using the following data 194structure: 195 196 Query: 197 Atom 198 | AND QueryList 199 | OR QueryList 200 | NOT Query 201 ; 202 203 Atom: 204 ConstQuery 205 | SubStringQuery 206 | RegexpQuery 207 | RepoQuery 208 | BranchQuery 209 ; 210 211Both SubStringQuery and RegexpQuery can apply to either file or 212contents, and can optionally be case-insensitive. 213 214ConstQuery (match everything, or match nothing) is a useful construct 215for partial evaluation of a query: for each index shard through which 216we search, we partially evaluate the query, eg. when the query is 217 218 and[substr:"needle" repo:"zoekt"] 219 220then we can rewrite the query to FALSE if we are looking at a shard 221for repository "bazel", skipping the entire shard. 222 223Each query must have at least one positive atom. Negations can only 224serve to prune results generated by positive atoms. 225 226 227Query parsing 228------------- 229 230Strings in the input language are considered regular expressions 231but literal regular expressions are simplified to Substring queries, 232 233 a.*b => regexp:"a.*b" 234 a\.b => substring:"a.b" 235 236leading modifiers select different types of atoms, eg. 237 238 file:java => Substring_file:"java" 239 branch:master => Repo:"master" 240 241parentheses inside a string (possibly with escaped spaces) are 242interpreted as regular expressions, otherwise they are used for grouping 243 244 (abc def) => and[substring:"abc" substring:"def"] 245 (abc\ def) => regexp:"(abc def)" 246 247there is an implicit "AND" between elements of a parenthesized list. 248There is an "OR" operator, which has lower priority than the implicit 249"AND": 250 251 ppp qqq or rrr sss => or[and[substring:"ppp" substring:"qqq"] and[substring:"rrr" substring:"sss"]] 252 253 254GERRIT/GITILES INTEGRATION 255========================== 256 257Gerrit is a popular system for code review on open source 258projects. Its sister project Gitiles provides a browsing experience. 259 260Any code search integration with Gerrit should be made available in 261Gitiles. Gerrit/Gitiles has a complex ACL system, so a codesearch 262solution for Gerrit/Gitiles should respect these ACLs. 263 264Since Gitiles knows the identity of the logged-in user, it can 265construct search queries that respect ACLs, and even filter results 266afterwards if necessary. In such a setup, only Gitiles is allowed to 267talk to the search service, so it should be protected from general 268access, e.g. by requiring authentication. 269 270A codesearch implementation for Gitiles would change Gitiles to show a 271search box on pages relating to a repository. When searching, Gitiles 272would also render the search results. The process is as follows: 273 274 * On receiving a query, Gitiles finds the list of branches visible to the user 275 * Gitiles sends the raw query, along with branches and repository to the search service 276 * The search service parses the query, and embeds it as follows 277 278 (AND original-query repo:REPO (OR "branch:visible-1" "branch:visible-2" .. )) 279 280 * The search service returns the search results, leaving it to 281 gitiles to render them. Gitiles can apply any further filtering 282 as necessary. 283 284 285SERVICE MANAGEMENT 286================== 287 288The above details how indexing and searching works. A fully fledged 289system also crawls repositories and (re)indexes them. Since the system 290is designed to run on a single machine, we provide a service 291management tool, with the following responsibilities: 292 293 * Poll git hosting sites (eg. github.com, googlesource.com), to fetch new updates 294 * Reindex any changed repositories 295 * Run the webserver; and restart if it goes down for any reason 296 * Delete old webserver logs 297 298 299Security 300-------- 301 302This section assumes that 'zoekt' is used as a public facing 303webserver, indexing publicly available data, serving on HTTPS without 304authentication. 305 306Since the UI is unauthenticated, there are no authentication secrets to steal. 307 308Since the code is public, there is no sensitive code to steal. 309 310This leaves us with the following senstive data: 311 312 * Credentials for accesssing git repostitories (eg. github access token) 313 * TLS server certificates 314 * Query logs 315 316The system handles the following untrusted data: 317 318 * code in git repositories 319 * search queries 320 321Since 'zoekt' itself is written in Go, it does not have memory 322security problems: at worst, a bug in the query parser would lead to a 323crash. 324 325The code to index is handled by `ctags` for symbol detection. The 326security risk this poses is mitigated by using a seccomp based 327sandboxing. 328 329 330Privacy 331------- 332 333Webserver logs can contain privacy sensitive data (such as IP 334addresses and search queries). For this reason, the service management 335tool deletes them after a configurable period of time.