fork of https://github.com/sourcegraph/zoekt
0

Configure Feed

Select the types of activity you want to include in your feed.

1# Frequently asked questions 2 3## Why codesearch? 4 5Software engineering is more about reading than writing code, and part 6of this process is finding the code that you should read. If you are 7working on a large project, then finding source code through 8navigation quickly becomes inefficient. 9 10Search engines let you find interesting code much faster than browsing 11code, in much the same way that search engines speed up finding things 12on the internet. 13 14## Can you give an example? 15 16I had to implement SSH hashed hostkey checking on a whim recently, and 17here is how I quickly zoomed into the relevant code using 18[our public zoekt instance](http://cs.bazel.build): 19 20* [hash host ssh](http://cs.bazel.build/search?q=hash+host+ssh&num=50): more than 20k results in 750 files, in 3 seconds 21 22* [hash host r:openssh](http://cs.bazel.build/search?q=hash+host+r%3Aopenssh&num=50): 6k results in 114 files, in 20ms 23 24* [hash host r:openssh known_host](http://cs.bazel.build/search?q=hash+host+r%3Aopenssh+known_host&num=50): 4k result in 42 files, in 13ms 25 26the last query still yielded a substantial number of results, but the 27function `hash_host` that I was looking for was the 3rd result from 28the first file. 29 30## What features make a code search engine great? 31 32Often, you don't know exactly what you are looking for, until you 33found it. Code search is effective because you can formulate an 34approximate query, and then refine it based on results you got. For 35this to work, you need the following features: 36 37* Coverage: the code that interests you should be available for searching 38 39* Speed: search should return useful results quickly (sub-second), so 40 you can iterate on queries 41 42* Approximate queries: matching should be done case insensitively, on 43 arbitrary substrings, so we don't have to know what we are looking 44 for in advance. 45 46* Filtering: we can winnow down results by composing more specific queries 47 48* Ranking: interesting results (eg. function definitions, whole word 49 matches) should be at the top. 50 51## How does `zoekt` provide for these? 52 53* Coverage: `zoekt` comes with tools to mirror parts of common Git 54 hosting sites. `cs.bazel.build` uses this to index most of the 55 Google authored open source software on github.com and 56 googlesource.com. 57 58* Speed: `zoekt` uses an index based on positional trigrams. For rare 59 strings, eg. `nienhuys`, this typically yields results in ~10ms if 60 the operating system caches are warm. 61 62* Approximate queries: `zoekt` supports substring patterns and regular 63 expressions, and can do case-insensitive matching on UTF-8 text. 64 65* Filtering: you can filter query by adding extra atoms (eg. `f:\.go$` 66 limits to Go source code), and filter out terms with `-`, so 67 `\blinus\b -torvalds` finds the Linuses other than Linus Torvalds. 68 69* Ranking: zoekt uses 70 [ctags](https://github.com/universal-ctags/ctags) to find 71 declarations, and these are boosted in the search ranking. 72 73 74## How does this compare to `grep -r`? 75 76Grep lets you find arbitrary substrings, but it doesn't scale to large 77corpuses, and lacks filtering and ranking. 78 79## What about my IDE? 80 81If your project fits into your IDE, than that is great. 82Unfortunately, loading projects into IDEs is slow, cumbersome, and not 83supported by all projects. 84 85## What about the search on `github.com`? 86 87GitHub's search has great coverage, but unfortunately, its search 88functionality doesn't support arbitrary substrings. For example, a 89query [for part of my 90surname](https://github.com/search?utf8=%E2%9C%93&q=nienhuy&type=Code) 91does not turn up anything (except this document), while 92[my complete 93name](https://github.com/search?utf8=%E2%9C%93&q=nienhuys&type=Code) 94does. 95 96## What about Etsy/Hound? 97 98[Etsy/hound](https://github.com/etsy/hound) is a code search engine 99which supports regular expressions over large corpuses, it is about 10010x slower than zoekt. However, there is only rudimentary support for 101filtering, and there is no symbol ranking. 102 103## What about livegrep? 104 105[livegrep](https://livegrep.com) is a code search engine which 106supports regular expressions over large corpuses. However, due to its 107indexing technique, it requires a lot of RAM and CPU. There is only 108rudimentary support for filtering, and there is no symbol ranking. 109 110## How much resources does `zoekt` require? 111 112The search server should have local SSD to store the index file (which 113is 3.5x the corpus size), and have at least 20% more RAM than the 114corpus size. 115 116## Can I index multiple branches? 117 118Yes. You can index 64 branches (see also 119https://github.com/google/zoekt/issues/32). Files that are identical 120across branches take up space just once in the index. 121 122## How fast is the search? 123 124Rare strings, are extremely fast to retrieve, for example `r:torvalds 125crazy` (search "crazy" in the linux kernel) typically takes [about 1267-10ms on 127cs.bazel.build](http://cs.bazel.build/search?q=r%3Atorvalds+crazy&num=70). 128 129The speed for common strings is dominated by how many results you want 130to see. For example [r:torvalds license] can give some results 131quickly, but producing [all 86k 132results](http://cs.bazel.build/search?q=r%3Atorvalds+license&num=50000) 133takes between 100ms and 1 second. Then, streaming the results to your 134browser, and rendering the HTML takes several seconds. 135 136## How fast is the indexer? 137 138The Linux kernel (55K files, 545M data) takes about 160s to index on 139my x250 laptop using a single thread. The process can be parallelized 140for speedup. 141 142## What does [cs.bazel.build](https://cs.bazel.build/) run on? 143 144Currently, it runs on a single Google Cloud VM with 16 vCPUs, 60G RAM and an 145attached physical SSD. 146 147## How does `zoekt` work? 148 149In short, it splits up the file in trigrams (groups of 3 unicode 150characters), and stores the offset of each occurrence. Substrings are 151found by searching different trigrams from the query at the correct 152distance apart. 153 154## I want to know more 155 156Some further background documentation 157 158 * [Designdoc](design.md) for technical details 159 * [Godoc](https://godoc.org/github.com/google/zoekt) 160 * Gerrit 2016 user summit: [slides](https://storage.googleapis.com/gerrit-talks/summit/2016/zoekt.pdf) 161 * Gerrit 2017 user summit: [transcript](https://gitenterprise.me/2017/11/01/gerrit-user-summit-zoekt-code-search-engine/), [slides](https://storage.googleapis.com/gerrit-talks/summit/2017/Zoekt%20-%20improved%20codesearch.pdf), [video](https://www.youtube.com/watch?v=_-KTAvgJYdI)