web: informative and verbose error message when watchdog fails (#647)
Right now we use panicf which leads a stack trace which is misleading at
what is happening and fills up the space used by kubernetes error
reporting. Additionally a few times we have had bug reports about the
watchdog failing. This commit updates the message to be far more
informative about next steps.
Additionally we update the watchdog error to include the response body
in case that contains useful information for debugging.
Test Plan: Updated the serveHealthz handler to always return an error.
Then ran the following
$ ZOEKT_WATCHDOG_TICK=1s go run ./cmd/zoekt-webserver
2023/09/14 15:55:27 custom ZOEKT_WATCHDOG_TICK=1s
2023/09/14 15:55:27 loading 1 shard(s): github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt
2023/09/14 15:55:28 watchdog: failed, will try 2 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:29 watchdog: failed, will try 1 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:30 watchdog health check has consecutively failed 3 times indicating is likely an unrecoverable error affecting zoekt. As such this process will exit with code 3.
Final error: watchdog: status=500 body="not ready: boom\n"
Possible Remediations:
- If this rarely happens, ignore and let your process manager restart zoekt.
- Possibly under provisioned. Try increasing CPU or disk IO.
- A bug. Reach out with logs and screenshots of metrics when this occurs.
exit status 3