fork of https://github.com/sourcegraph/zoekt
1# Frequently asked questions
2
3## Why codesearch?
4
5Software engineering is more about reading than writing code, and part
6of this process is finding the code that you should read. If you are
7working on a large project, then finding source code through
8navigation quickly becomes inefficient.
9
10Search engines let you find interesting code much faster than browsing
11code, in much the same way that search engines speed up finding things
12on the internet.
13
14## Can you give an example?
15
16I had to implement SSH hashed hostkey checking on a whim recently, and
17here is how I quickly zoomed into the relevant code using
18[our public zoekt instance](http://cs.bazel.build):
19
20* [hash host ssh](http://cs.bazel.build/search?q=hash+host+ssh&num=50): more than 20k results in 750 files, in 3 seconds
21
22* [hash host r:openssh](http://cs.bazel.build/search?q=hash+host+r%3Aopenssh&num=50): 6k results in 114 files, in 20ms
23
24* [hash host r:openssh known_host](http://cs.bazel.build/search?q=hash+host+r%3Aopenssh+known_host&num=50): 4k result in 42 files, in 13ms
25
26the last query still yielded a substantial number of results, but the
27function `hash_host` that I was looking for was the 3rd result from
28the first file.
29
30## What features make a code search engine great?
31
32Often, you don't know exactly what you are looking for, until you
33found it. Code search is effective because you can formulate an
34approximate query, and then refine it based on results you got. For
35this to work, you need the following features:
36
37* Coverage: the code that interests you should be available for searching
38
39* Speed: search should return useful results quickly (sub-second), so
40 you can iterate on queries
41
42* Approximate queries: matching should be done case insensitively, on
43 arbitrary substrings, so we don't have to know what we are looking
44 for in advance.
45
46* Filtering: we can winnow down results by composing more specific queries
47
48* Ranking: interesting results (eg. function definitions, whole word
49 matches) should be at the top.
50
51## How does `zoekt` provide for these?
52
53* Coverage: `zoekt` comes with tools to mirror parts of common Git
54 hosting sites. `cs.bazel.build` uses this to index most of the
55 Google authored open source software on github.com and
56 googlesource.com.
57
58* Speed: `zoekt` uses an index based on positional trigrams. For rare
59 strings, eg. `nienhuys`, this typically yields results in ~10ms if
60 the operating system caches are warm.
61
62* Approximate queries: `zoekt` supports substring patterns and regular
63 expressions, and can do case-insensitive matching on UTF-8 text.
64
65* Filtering: you can filter query by adding extra atoms (eg. `f:\.go$`
66 limits to Go source code), and filter out terms with `-`, so
67 `\blinus\b -torvalds` finds the Linuses other than Linus Torvalds.
68
69* Ranking: zoekt uses
70 [ctags](https://github.com/universal-ctags/ctags) to find
71 declarations, and these are boosted in the search ranking.
72
73
74## How does this compare to `grep -r`?
75
76Grep lets you find arbitrary substrings, but it doesn't scale to large
77corpuses, and lacks filtering and ranking.
78
79## What about my IDE?
80
81If your project fits into your IDE, than that is great.
82Unfortunately, loading projects into IDEs is slow, cumbersome, and not
83supported by all projects.
84
85## What about the search on `github.com`?
86
87GitHub's search has great coverage, but unfortunately, its search
88functionality doesn't support arbitrary substrings. For example, a
89query [for part of my
90surname](https://github.com/search?utf8=%E2%9C%93&q=nienhuy&type=Code)
91does not turn up anything (except this document), while
92[my complete
93name](https://github.com/search?utf8=%E2%9C%93&q=nienhuys&type=Code)
94does.
95
96## What about Etsy/Hound?
97
98[Etsy/hound](https://github.com/etsy/hound) is a code search engine
99which supports regular expressions over large corpuses, it is about
10010x slower than zoekt. However, there is only rudimentary support for
101filtering, and there is no symbol ranking.
102
103## What about livegrep?
104
105[livegrep](https://livegrep.com) is a code search engine which
106supports regular expressions over large corpuses. However, due to its
107indexing technique, it requires a lot of RAM and CPU. There is only
108rudimentary support for filtering, and there is no symbol ranking.
109
110## How much resources does `zoekt` require?
111
112The search server should have local SSD to store the index file (which
113is 3.5x the corpus size), and have at least 20% more RAM than the
114corpus size. For optimal performance with large codebases, consider
115using machines with ample CPU cores, as search operations can be
116parallelized across shards.
117
118## Can I index multiple branches?
119
120Yes. You can index 64 branches (see also
121https://github.com/google/zoekt/issues/32). Files that are identical
122across branches take up space just once in the index.
123
124## How fast is the search?
125
126Rare strings, are extremely fast to retrieve, for example `r:torvalds
127crazy` (search "crazy" in the linux kernel) typically takes [about
1287-10ms on
129cs.bazel.build](http://cs.bazel.build/search?q=r%3Atorvalds+crazy&num=70).
130
131The speed for common strings is dominated by how many results you want
132to see. For example [r:torvalds license] can give some results
133quickly, but producing [all 86k
134results](http://cs.bazel.build/search?q=r%3Atorvalds+license&num=50000)
135takes between 100ms and 1 second. Then, streaming the results to your
136browser, and rendering the HTML takes several seconds.
137
138## How fast is the indexer?
139
140The Linux kernel (55K files, 545M data) takes about 160s to index on
141my x250 laptop using a single thread. The process can be parallelized
142for speedup.
143
144## What does [cs.bazel.build](https://cs.bazel.build/) run on?
145
146Currently, it runs on a single Google Cloud VM with 16 vCPUs, 60G RAM and an
147attached physical SSD.
148
149## How does `zoekt` work?
150
151In short, it splits up the file in trigrams (groups of 3 unicode
152characters), and stores the offset of each occurrence. Substrings are
153found by searching different trigrams from the query at the correct
154distance apart.
155
156## I want to know more
157
158Some further background documentation
159
160 * [Designdoc](design.md) for technical details
161 * [Godoc](https://godoc.org/github.com/google/zoekt)
162 * Gerrit 2016 user summit: [slides](https://storage.googleapis.com/gerrit-talks/summit/2016/zoekt.pdf)
163 * Gerrit 2017 user summit: [transcript](https://gitenterprise.me/2017/11/01/gerrit-user-summit-zoekt-code-search-engine/), [slides](https://storage.googleapis.com/gerrit-talks/summit/2017/Zoekt%20-%20improved%20codesearch.pdf), [video](https://www.youtube.com/watch?v=_-KTAvgJYdI)