fork of https://github.com/sourcegraph/zoekt
0

Configure Feed

Select the types of activity you want to include in your feed.

1# Frequently asked questions 2 3## Why codesearch? 4 5Software engineering is more about reading than writing code, and part 6of this process is finding the code that you should read. If you are 7working on a large project, then finding source code through 8navigation quickly becomes inefficient. 9 10Search engines let you find interesting code much faster than browsing 11code, in much the same way that search engines speed up finding things 12on the internet. 13 14## Can you give an example? 15 16I had to implement SSH hashed hostkey checking on a whim recently, and 17here is how I quickly zoomed into the relevant code using 18[our public zoekt instance](http://cs.bazel.build): 19 20* [hash host ssh](http://cs.bazel.build/search?q=hash+host+ssh&num=50): more than 20k results in 750 files, in 3 seconds 21 22* [hash host r:openssh](http://cs.bazel.build/search?q=hash+host+r%3Aopenssh&num=50): 6k results in 114 files, in 20ms 23 24* [hash host r:openssh known_host](http://cs.bazel.build/search?q=hash+host+r%3Aopenssh+known_host&num=50): 4k result in 42 files, in 13ms 25 26the last query still yielded a substantial number of results, but the 27function `hash_host` that I was looking for was the 3rd result from 28the first file. 29 30## What features make a code search engine great? 31 32Often, you don't know exactly what you are looking for, until you 33found it. Code search is effective because you can formulate an 34approximate query, and then refine it based on results you got. For 35this to work, you need the following features: 36 37* Coverage: the code that interests you should be available for searching 38 39* Speed: search should return useful results quickly (sub-second), so 40 you can iterate on queries 41 42* Approximate queries: matching should be done case insensitively, on 43 arbitrary substrings, so we don't have to know what we are looking 44 for in advance. 45 46* Filtering: we can winnow down results by composing more specific queries 47 48* Ranking: interesting results (eg. function definitions, whole word 49 matches) should be at the top. 50 51## How does `zoekt` provide for these? 52 53* Coverage: `zoekt` comes with tools to mirror parts of common Git 54 hosting sites. `cs.bazel.build` uses this to index most of the 55 Google authored open source software on github.com and 56 googlesource.com. 57 58* Speed: `zoekt` uses an index based on positional trigrams. For rare 59 strings, eg. `nienhuys`, this typically yields results in ~10ms if 60 the operating system caches are warm. 61 62* Approximate queries: `zoekt` supports substring patterns and regular 63 expressions, and can do case-insensitive matching on UTF-8 text. 64 65* Filtering: you can filter query by adding extra atoms (eg. `f:\.go$` 66 limits to Go source code), and filter out terms with `-`, so 67 `\blinus\b -torvalds` finds the Linuses other than Linus Torvalds. 68 69* Ranking: zoekt uses 70 [ctags](https://github.com/universal-ctags/ctags) to find 71 declarations, and these are boosted in the search ranking. 72 73 74## How does this compare to `grep -r`? 75 76Grep lets you find arbitrary substrings, but it doesn't scale to large 77corpuses, and lacks filtering and ranking. 78 79## What about my IDE? 80 81If your project fits into your IDE, than that is great. 82Unfortunately, loading projects into IDEs is slow, cumbersome, and not 83supported by all projects. 84 85## What about the search on `github.com`? 86 87GitHub's search has great coverage, but unfortunately, its search 88functionality doesn't support arbitrary substrings. For example, a 89query [for part of my 90surname](https://github.com/search?utf8=%E2%9C%93&q=nienhuy&type=Code) 91does not turn up anything (except this document), while 92[my complete 93name](https://github.com/search?utf8=%E2%9C%93&q=nienhuys&type=Code) 94does. 95 96## What about Etsy/Hound? 97 98[Etsy/hound](https://github.com/etsy/hound) is a code search engine 99which supports regular expressions over large corpuses, it is about 10010x slower than zoekt. However, there is only rudimentary support for 101filtering, and there is no symbol ranking. 102 103## What about livegrep? 104 105[livegrep](https://livegrep.com) is a code search engine which 106supports regular expressions over large corpuses. However, due to its 107indexing technique, it requires a lot of RAM and CPU. There is only 108rudimentary support for filtering, and there is no symbol ranking. 109 110## How much resources does `zoekt` require? 111 112The search server should have local SSD to store the index file (which 113is 3.5x the corpus size), and have at least 20% more RAM than the 114corpus size. For optimal performance with large codebases, consider 115using machines with ample CPU cores, as search operations can be 116parallelized across shards. 117 118## Can I index multiple branches? 119 120Yes. You can index 64 branches (see also 121https://github.com/google/zoekt/issues/32). Files that are identical 122across branches take up space just once in the index. 123 124## How fast is the search? 125 126Rare strings, are extremely fast to retrieve, for example `r:torvalds 127crazy` (search "crazy" in the linux kernel) typically takes [about 1287-10ms on 129cs.bazel.build](http://cs.bazel.build/search?q=r%3Atorvalds+crazy&num=70). 130 131The speed for common strings is dominated by how many results you want 132to see. For example [r:torvalds license] can give some results 133quickly, but producing [all 86k 134results](http://cs.bazel.build/search?q=r%3Atorvalds+license&num=50000) 135takes between 100ms and 1 second. Then, streaming the results to your 136browser, and rendering the HTML takes several seconds. 137 138## How fast is the indexer? 139 140The Linux kernel (55K files, 545M data) takes about 160s to index on 141my x250 laptop using a single thread. The process can be parallelized 142for speedup. 143 144## What does [cs.bazel.build](https://cs.bazel.build/) run on? 145 146Currently, it runs on a single Google Cloud VM with 16 vCPUs, 60G RAM and an 147attached physical SSD. 148 149## How does `zoekt` work? 150 151In short, it splits up the file in trigrams (groups of 3 unicode 152characters), and stores the offset of each occurrence. Substrings are 153found by searching different trigrams from the query at the correct 154distance apart. 155 156## I want to know more 157 158Some further background documentation 159 160 * [Designdoc](design.md) for technical details 161 * [Godoc](https://godoc.org/github.com/google/zoekt) 162 * Gerrit 2016 user summit: [slides](https://storage.googleapis.com/gerrit-talks/summit/2016/zoekt.pdf) 163 * Gerrit 2017 user summit: [transcript](https://gitenterprise.me/2017/11/01/gerrit-user-summit-zoekt-code-search-engine/), [slides](https://storage.googleapis.com/gerrit-talks/summit/2017/Zoekt%20-%20improved%20codesearch.pdf), [video](https://www.youtube.com/watch?v=_-KTAvgJYdI)