doc/faq.md at ed735fa5e9b8cc822e026f996551b8dfb20dfbad · boltless.me/zoekt

boltless.me / zoekt
Fork 0
fork of https://github.com/sourcegraph/zoekt
Fork 0
zoekt / doc / faq.md
at ed735fa5e9b8cc822e026f996551b8dfb20dfbad 163 lines 6.6 kB View raw View rendered
wrap content
Keegan Carruthers-Smith docs: some updates (#952) 1y ago
080acfdf
  1# Frequently asked questions
  2
  3## Why codesearch?
  4
  5Software engineering is more about reading than writing code, and part
  6of this process is finding the code that you should read. If you are
  7working on a large project, then finding source code through
  8navigation quickly becomes inefficient.
  9
 10Search engines let you find interesting code much faster than browsing
 11code, in much the same way that search engines speed up finding things
 12on the internet.
 13
 14## Can you give an example?
 15
 16I had to implement SSH hashed hostkey checking on a whim recently, and
 17here is how I quickly zoomed into the relevant code using
 18[our public zoekt instance](http://cs.bazel.build):
 19
 20* [hash host ssh](http://cs.bazel.build/search?q=hash+host+ssh&num=50): more than 20k results in 750 files, in 3 seconds
 21
 22* [hash host r:openssh](http://cs.bazel.build/search?q=hash+host+r%3Aopenssh&num=50): 6k results in 114 files, in 20ms
 23
 24* [hash host r:openssh known_host](http://cs.bazel.build/search?q=hash+host+r%3Aopenssh+known_host&num=50): 4k result in 42 files, in 13ms
 25
 26the last query still yielded a substantial number of results, but the
 27function `hash_host` that I was looking for was the 3rd result from
 28the first file.
 29
 30## What features make a code search engine great?
 31
 32Often, you don't know exactly what you are looking for, until you
 33found it. Code search is effective because you can formulate an
 34approximate query, and then refine it based on results you got. For
 35this to work, you need the following features:
 36
 37* Coverage: the code that interests you should be available for searching
 38
 39* Speed: search should return useful results quickly (sub-second), so
 40  you can iterate on queries
 41
 42* Approximate queries: matching should be done case insensitively, on
 43  arbitrary substrings, so we don't have to know what we are looking
 44  for in advance.
 45
 46* Filtering: we can winnow down results by composing more specific queries
 47
 48* Ranking: interesting results (eg. function definitions, whole word
 49  matches) should be at the top.
 50
 51## How does `zoekt` provide for these?
 52
 53* Coverage: `zoekt` comes with tools to mirror parts of common Git
 54  hosting sites. `cs.bazel.build` uses this to index most of the
 55  Google authored open source software on github.com and
 56  googlesource.com.
 57
 58* Speed: `zoekt` uses an index based on positional trigrams. For rare
 59  strings, eg. `nienhuys`, this typically yields results in ~10ms if
 60  the operating system caches are warm.
 61
 62* Approximate queries: `zoekt` supports substring patterns and regular
 63  expressions, and can do case-insensitive matching on UTF-8 text.
 64
 65* Filtering: you can filter query by adding extra atoms (eg. `f:\.go$`
 66  limits to Go source code), and filter out terms with `-`, so
 67  `\blinus\b -torvalds` finds the Linuses other than Linus Torvalds.
 68
 69* Ranking: zoekt uses
 70  [ctags](https://github.com/universal-ctags/ctags) to find
 71  declarations, and these are boosted in the search ranking.
 72
 73
 74## How does this compare to `grep -r`?
 75
 76Grep lets you find arbitrary substrings, but it doesn't scale to large
 77corpuses, and lacks filtering and ranking.
 78
 79## What about my IDE?
 80
 81If your project fits into your IDE, than that is great.
 82Unfortunately, loading projects into IDEs is slow, cumbersome, and not
 83supported by all projects.
 84
 85## What about the search on `github.com`?
 86
 87GitHub's search has great coverage, but unfortunately, its search
 88functionality doesn't support arbitrary substrings. For example, a
 89query [for part of my
 90surname](https://github.com/search?utf8=%E2%9C%93&q=nienhuy&type=Code)
 91does not turn up anything (except this document), while
 92[my complete
 93name](https://github.com/search?utf8=%E2%9C%93&q=nienhuys&type=Code)
 94does.
 95
 96## What about Etsy/Hound?
 97
 98[Etsy/hound](https://github.com/etsy/hound) is a code search engine
 99which supports regular expressions over large corpuses, it is about
10010x slower than zoekt. However, there is only rudimentary support for
101filtering, and there is no symbol ranking.
102
103## What about livegrep?
104
105[livegrep](https://livegrep.com) is a code search engine which
106supports regular expressions over large corpuses. However, due to its
107indexing technique, it requires a lot of RAM and CPU.  There is only
108rudimentary support for filtering, and there is no symbol ranking.
109
110## How much resources does `zoekt` require?
111
112The search server should have local SSD to store the index file (which
113is 3.5x the corpus size), and have at least 20% more RAM than the
114corpus size. For optimal performance with large codebases, consider
115using machines with ample CPU cores, as search operations can be
116parallelized across shards.
117
118## Can I index multiple branches?
119
120Yes. You can index 64 branches (see also
121https://github.com/google/zoekt/issues/32). Files that are identical
122across branches take up space just once in the index.
123
124## How fast is the search?
125
126Rare strings, are extremely fast to retrieve, for example `r:torvalds
127crazy` (search "crazy" in the linux kernel) typically takes [about
1287-10ms on
129cs.bazel.build](http://cs.bazel.build/search?q=r%3Atorvalds+crazy&num=70).
130
131The speed for common strings is dominated by how many results you want
132to see. For example [r:torvalds license] can give some results
133quickly, but producing [all 86k
134results](http://cs.bazel.build/search?q=r%3Atorvalds+license&num=50000)
135takes between 100ms and 1 second. Then, streaming the results to your
136browser, and rendering the HTML takes several seconds.
137
138## How fast is the indexer?
139
140The Linux kernel (55K files, 545M data) takes about 160s to index on
141my x250 laptop using a single thread.  The process can be parallelized
142for speedup.
143
144## What does [cs.bazel.build](https://cs.bazel.build/) run on?
145
146Currently, it runs on a single Google Cloud VM with 16 vCPUs, 60G RAM and an
147attached physical SSD.
148
149## How does `zoekt` work?
150
151In short, it splits up the file in trigrams (groups of 3 unicode
152characters), and stores the offset of each occurrence. Substrings are
153found by searching different trigrams from the query at the correct
154distance apart.
155
156## I want to know more
157
158Some further background documentation
159
160 * [Designdoc](design.md) for technical details
161 * [Godoc](https://godoc.org/github.com/google/zoekt)
162 * Gerrit 2016 user summit: [slides](https://storage.googleapis.com/gerrit-talks/summit/2016/zoekt.pdf)
163 * Gerrit 2017 user summit: [transcript](https://gitenterprise.me/2017/11/01/gerrit-user-summit-zoekt-code-search-engine/),  [slides](https://storage.googleapis.com/gerrit-talks/summit/2017/Zoekt%20-%20improved%20codesearch.pdf), [video](https://www.youtube.com/watch?v=_-KTAvgJYdI)
Configure Feed

Configure Feed