fork of https://github.com/sourcegraph/zoekt
0

Configure Feed

Select the types of activity you want to include in your feed.

web: informative and verbose error message when watchdog fails (#647)

Right now we use panicf which leads a stack trace which is misleading at
what is happening and fills up the space used by kubernetes error
reporting. Additionally a few times we have had bug reports about the
watchdog failing. This commit updates the message to be far more
informative about next steps.

Additionally we update the watchdog error to include the response body
in case that contains useful information for debugging.

Test Plan: Updated the serveHealthz handler to always return an error.
Then ran the following

$ ZOEKT_WATCHDOG_TICK=1s go run ./cmd/zoekt-webserver
2023/09/14 15:55:27 custom ZOEKT_WATCHDOG_TICK=1s
2023/09/14 15:55:27 loading 1 shard(s): github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt
2023/09/14 15:55:28 watchdog: failed, will try 2 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:29 watchdog: failed, will try 1 more times: watchdog: status=500 body="not ready: boom\n"
2023/09/14 15:55:30 watchdog health check has consecutively failed 3 times indicating is likely an unrecoverable error affecting zoekt. As such this process will exit with code 3.
Final error: watchdog: status=500 body="not ready: boom\n"
Possible Remediations:
- If this rarely happens, ignore and let your process manager restart zoekt.
- Possibly under provisioned. Try increasing CPU or disk IO.
- A bug. Reach out with logs and screenshots of metrics when this occurs.
exit status 3

+13 -2
+13 -2
cmd/zoekt-webserver/main.go
··· 24 24 "flag" 25 25 "fmt" 26 26 "html/template" 27 + "io" 27 28 "log" 28 29 "net" 29 30 "net/http" ··· 438 439 if err != nil { 439 440 return err 440 441 } 442 + body, _ := io.ReadAll(resp.Body) 443 + _ = resp.Body.Close() 441 444 442 445 if resp.StatusCode != http.StatusOK { 443 - return fmt.Errorf("watchdog: status %v", resp.StatusCode) 446 + return fmt.Errorf("watchdog: status=%v body=%q", resp.StatusCode, string(body)) 444 447 } 445 448 return nil 446 449 } ··· 462 465 metricWatchdogErrors.Set(float64(errCount)) 463 466 metricWatchdogErrorsTotal.Inc() 464 467 if errCount >= maxErrCount { 465 - log.Panicf("watchdog: %v", err) 468 + log.Printf(`watchdog health check has consecutively failed %d times indicating is likely an unrecoverable error affecting zoekt. As such this process will exit with code 3. 469 + 470 + Final error: %v 471 + 472 + Possible remediations: 473 + - If this rarely happens, ignore and let your process manager restart zoekt. 474 + - Possibly under provisioned. Try increasing CPU or disk IO. 475 + - A bug. Reach out with logs and screenshots of metrics when this occurs.`, errCount, err) 476 + os.Exit(3) 466 477 } else { 467 478 log.Printf("watchdog: failed, will try %d more times: %v", maxErrCount-errCount, err) 468 479 }