Commits

This adds a monitor which will report every minute the progress of
symbol analysis. Additionally, if a document is taking too long to
analyse (10s) we report it.

At first this is just reporting via stdlog. However, once we are
comfortable with thresholds around this we can likely also include a way
to kill analysis for a file.

Test Plan: Adjusted monitorReportStatus to 1s then indexed the
sourcegraph repo and inspected the output

$ go run ./cmd/zoekt-git-index -require_ctags ../sourcegraph/
2023/11/03 16:03:10 attempting to index 14533 total files
2023/11/03 16:03:13 DEBUG: symbol analysis still running for shard statistics: duration=1s symbols=15805 bytes=44288971
2023/11/03 16:03:14 DEBUG: symbol analysis still running for shard statistics: duration=2s symbols=26189 bytes=51564417
2023/11/03 16:03:15 DEBUG: symbol analysis still running for shard statistics: duration=3s symbols=55613 bytes=64748084
2023/11/03 16:03:16 DEBUG: symbol analysis still running for shard statistics: duration=4s symbols=86557 bytes=93771404
2023/11/03 16:03:17 DEBUG: symbol analysis still running for shard statistics: duration=5s symbols=125352 bytes=116319453
2023/11/03 16:03:18 symbol analysis finished for shard statistics: duration=5s symbols=142951 bytes=129180023
2023/11/03 16:03:22 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 283983298 index bytes (overhead 2.8), 14533 files processed

I then added a random sleep for a minute in a file to see the stuck
reporting:

$ go run ./cmd/zoekt-git-index -require_ctags ../sourcegraph/
2023/11/03 16:14:57 attempting to index 14533 total files
2023/11/03 16:15:15 WARN: symbol analysis for README.md (3485 bytes) has been running for 14s
2023/11/03 16:15:25 WARN: symbol analysis for README.md (3485 bytes) has been running for 24s
2023/11/03 16:15:45 WARN: symbol analysis for README.md (3485 bytes) has been running for 44s
2023/11/03 16:16:00 DEBUG: symbol analysis still running for shard statistics: duration=1m0s symbols=958 bytes=624329
2023/11/03 16:16:00 symbol analysis for README.md (size 3485 bytes) is done and found 4 symbols
2023/11/03 16:16:06 symbol analysis finished for shard statistics: duration=1m5s symbols=142951 bytes=129180023
2023/11/03 16:16:10 finished shard github.com%2Fsourcegraph%2Fsourcegraph_v16.00000.zoekt: 283983299 index bytes (overhead 2.8), 14533 files processed

2y ago

Julie Tibshirani

c7e066e9

Scoring: test against local scip-ctags (#677)

2y ago

Julie Tibshirani

503302fe

Debug: make indexing timeout configurable (#676)

2y ago

Julie Tibshirani

c23ed052

Ranking: standardize ctags kind names before scoring (#674)

2y ago

Keegan Carruthers-Smith

b5a5fdc8

score: boost exported go ident and downrank _test.go (#675)

2y ago

Keegan Carruthers-Smith

1a3dddce

score: experimental extension novelty in sorting (#665)

2y ago

Julie Tibshirani

0f21f325

C-tags: use type def instead of type alias (#672)

2y ago

Keegan Carruthers-Smith

d4ba942a

gomod: update go-ctags and ctags in docker (#673)

2y ago

Stefan Hengl

0a17ccb2

scoring: reduce allocations for addScore (#670)

2y ago

Stefan Hengl

5bbf05d4

fix tests (#671)

2y ago

Stefan Hengl

ca7ee51e

scoring: show atom count in debug score (#669)

2y ago

Stefan Hengl

c869a248

scoring: score methods and funcs the same (#666)

2y ago

Keegan Carruthers-Smith

d3fc0dce

score: remove repetition-boost (#667)

2y ago

Keegan Carruthers-Smith

7cc2872d

nix: use tag for rev in ctags derivation

2y ago

Keegan Carruthers-Smith

328fcb7a

nix: use go 1.21 and universal-ctags 6.0.0 (#664)

2y ago

Keegan Carruthers-Smith

4f214152

score: clean up debug output (#663)

2y ago

Keegan Carruthers-Smith

d8bfea1e

score: factors for headers in markdown (#661)

2y ago

Keegan Carruthers-Smith

70f5dd3f

score: always upscore symbol matches (#662)

2y ago

Keegan Carruthers-Smith

081cd037

boost graphql types in results (#659)

2y ago

Keegan Carruthers-Smith

f39e6eb5

zoekt: add debug flag to show DebugScore output (#660)

2y ago

William Bezuidenhout

16e2ff8c

logger: remove description param (#657)

2y ago

Keegan Carruthers-Smith +1

f17ff0ba

scoring: handle scip-ctags kinds (#655)

2y ago

Keegan Carruthers-Smith

659eac98

all: remove deprecated RepoList.Minimal (#624)

2y ago

Anatoli Babenia

8c5bd7de

Use Go 1.21.2 (#653)

2y ago

Keegan Carruthers-Smith

cc1b5cda

ctags: allow binary to be anything with validation (#652)

2y ago

Keegan Carruthers-Smith

1065c664

gomod: bump go-ctags

2y ago

Geoffrey Gilmore

dfc14cb6

grpc: add support for prometheus metrics that calculates message size (#651)

2y ago

Geoffrey Gilmore

2011bba5

grpc: zoekt-sourcegraph-indexserer: enable by default, support reading from SG_FEATURE_FLAG_GRPC (#650)

2y ago

Geoffrey Gilmore

089709e4

zoekt: upgrade to grpc-ecosystem/go-grpc-middleware@v2.0.0 (#648)

2y ago

Michael Lin

19f03eed

chore: upgrade sourcegraph/log (#649)

2y ago

Keegan Carruthers-Smith

adf376d3

web: informative and verbose error message when watchdog fails (#647)

2y ago

Stefan Hengl +2

af126653

indexserver: delete tmp dir on startup (#646)

2y ago

Geoffrey Gilmore

48ed5ac5

grpc: zoekt-sourcegraph-indexserver: support retries when frontend isn't available (#645)

2y ago

Geoffrey Gilmore

2d1affd4

grpc: RepoList: actually persist "repos" field when converting to protobuf message (#644)

2y ago

Geoffrey Gilmore

3ce1f2b2

grpc: add prometheus server and client prometheus metrics (#642)

2y ago

Geoffrey Gilmore

40a9a23b

grpc: FileMatch: tweak file_name to be bytes instead of string (#641)

2y ago

Geoffrey Gilmore

f75df3d8

grpc: port messagesize interceptors and raise default client message size to 90mb (#640)

2y ago

Geoffrey Gilmore

993cfdb2

grpc: port internal error interceptors from sourcegraph/sourcegraph (#639)

2y ago

Geoffrey Gilmore

fcb279ae

grpc: zoekt-webserver: stream search: break up file matches across multiple messages (#636)

2y ago

Camden Cheek

956d775e

Extract samplingSender and use it for gRPC (#637)

2y ago

Dave Try

d5723536

remove bazel (#634)

2y ago

Keegan Carruthers-Smith

63da184a

stat: introduce timing stats around shard search (#633)

2y ago

Ian Kerins

9559422b

DisplayTruncator: always apply both limits (#632)

2y ago

Keegan Carruthers-Smith

eede1229

gofmt -s -w .

2y ago

Keegan Carruthers-Smith

626c7d8f

introduce DisplayTruncator (#630)

2y ago

Ian Kerins

6a428ad6

SearchOptions: add MaxMatchDisplayCount (#615)

All clients of zoekt have a shared problem: they have no reliable way to
bound the size of the SearchResult. The primary dimension that
determines the size of a SearchResult is the number of matches. None of
the existing levers zoekt provides sufficiently limit this size:
- MaxDocDisplayCount is a hard limit on the number of Files in the
SearchResult. But when a single File can have an arbitrary number of
matches for the query, you can still end up with enormous
SearchResults when this parameter is 1.

The existing *MaxMatchCount parameters are more about limiting the
amount of work zoekt does when executing queries than they are about
limiting the response size:
- TotalMaxMatchCount is a soft limit on the number of matches
across shards. But it is only evaluated after handling each shard, so
if a single shard has an enormous number of matches, the SearchResult
will be enormous.
- ShardMaxMatchCount is a soft limit on the number of matches from a
single shard. But it is only evaluated after handling each document, so
if a single document has an enormous number of matches, the
SearchResult will be enormous.
- ShardRepoMaxMatchCount, well, you get the idea.

Different clients have a differing ability to tolerate enormous
SearchResults. Sourcegraph, for example, is apparently doing just fine;
they put hard limits on the number of matches in their own server, which
is itself a client of zoekt. They're presumably able to tolerate large
responses from zoekt as it's running colocated in a datacenter
environment.

But clients that are, for example, running in browsers, and using the
less-compact JSON-encoded API, are much less able to cope with enormous
SearchResults, which can be multiple megabytes large even with the most
conservative applications of the existing parameters.

Enter MaxMatchDisplayCount, which has similar semantics to
MaxDocDisplayCount, and is used by zoekt in the exact same places as
that parameter. With this, clients can get a much better handle on the
size of zoekt SearchResults.