At 20:32 UTC on May 2nd, 2023, multiple alerts in Honeycomb were triggered at once that indicated a problem with ingestion on api.honeycomb.io, which is primarily handled by a service we call Shepherd. The service remained degraded until 21:06 UTC, with brief periods of only partial degradation during the incident.
This report will recount what was happening during that period (an initial degradation, immediate re-implementation of a caching mechanism, circuit breaking the service, and return to stability).
The alerts that fired at the beginning of the incident were:
Response time on the events endpoint, and the batch endpoint on api.honeycomb.io
Shepherd ALB timeout/error for GRPC and for HTTPS
Kubernetes Pods in CrashLoopBackOff
Simple HTTP POST endpoint check on api.honeycomb.io
This flurry of alerts clearly indicated a problem with api.honeycomb.io and the Shepherd service which is the first responder to that endpoint. The on-call engineer immediately spun up an incident channel with the Jeli bot as there was no ambiguity about the need to respond, and Jeli took care of broadcasting the incident company-wide and assigning an incident commander.
Looking at a heatmap of memory usage for the service, there was a clear degradation of the service around the same time the alerts fired - the average usage was relatively steady until an inflection point where it started scaling up linearly, and after ~10 minutes, a disintegration of the steady memory usage statistics due to crashing Kubernetes pods. An engineer quickly noticed that around 16:18 (20:18 UTC) a feature flag that enabled a new schema cache pattern was turned off, as it was identified as a possible culprit in another operational issue with a surge in database writes. As the flag was turned off, the old cache began taking up memory, but the new cache had not been released from service yet, resulting in the linear increase of memory usage.
An engineer quickly turned the flag back on (within 2 minutes of the incident being declared), but this didn’t solve the main issue: all of the Shepherd pods were in CrashLoopBackOff, and whenever a new pod came up, it was overwhelmed with nearly all the traffic that the service normally handles, and crashed.
Around the time that was identified, a communications lead was assigned who created the public Statuspage incident, and we started hearing reports of customers having trouble sending data to the Honeycomb API.
Having at least given the service a solution to the crashes, we needed to get ourselves into a state where it was ready to accept traffic again. The 3 key measures we took were:
The decision to completely circuit break the service wasn’t taken lightly; but, given that a significant portion of traffic was being rejected anyway, giving the cluster a chance to recover was likely the quickest way out. Traffic was blocked overall for ~8 minutes, after which the cluster became operational after another 6 minutes.
All in all, this was around 30 minutes of outage, with 5 minutes of sporadic flaps back to partial service, then 6 minutes of partial service until recovery.
This incident had many interesting factors - for example, the aforementioned schema caching pattern is extremely new but has already become an indispensable part of the Shepherd service. Turning it on while it was in development on 4/25 caused a 2 minute ingestion outage, but after releasing it to production on 4/27 (with more acclimation to allow the cache to warm up), turning it off on 5/2 caused a 30 minute one. By suddenly switching to a different cache implementation while under load, the memory demands effectively doubled. This, among other aspects, will be discussed in the full report once the post-incident investigation has been completed.