Ingest down

Incident Report for Honeycomb

Postmortem

At 20:32 UTC on May 2nd, 2023, multiple alerts in Honeycomb were triggered at once that indicated a problem with ingestion on api.honeycomb.io, which is primarily handled by a service we call Shepherd. The service remained degraded until 21:06 UTC, with brief periods of only partial degradation during the incident.

This report will recount what was happening during that period (an initial degradation, immediate re-implementation of a caching mechanism, circuit breaking the service, and return to stability).

The alerts that fired at the beginning of the incident were:

Response time on the events endpoint, and the batch endpoint on api.honeycomb.io
- These are simple but effective triggers to make sure our response times are within acceptable ranges for ingestion.
Shepherd ALB timeout/error for GRPC and for HTTPS
- These both are SLOs that make sure that the response codes sent by our ingestion load balancers match the response codes sent by our ingestion service (Shepherd). This is configured fairly tightly, as this should never happen unless something is broken.
Kubernetes Pods in CrashLoopBackOff
- Never a good thing. This alert isn’t specific to a particular service, but any pods in this state; in this case, it was Shepherd pods.
Simple HTTP POST endpoint check on api.honeycomb.io

This flurry of alerts clearly indicated a problem with api.honeycomb.io and the Shepherd service which is the first responder to that endpoint. The on-call engineer immediately spun up an incident channel with the Jeli bot as there was no ambiguity about the need to respond, and Jeli took care of broadcasting the incident company-wide and assigning an incident commander.

Looking at a heatmap of memory usage for the service, there was a clear degradation of the service around the same time the alerts fired - the average usage was relatively steady until an inflection point where it started scaling up linearly, and after ~10 minutes, a disintegration of the steady memory usage statistics due to crashing Kubernetes pods. An engineer quickly noticed that around 16:18 (20:18 UTC) a feature flag that enabled a new schema cache pattern was turned off, as it was identified as a possible culprit in another operational issue with a surge in database writes. As the flag was turned off, the old cache began taking up memory, but the new cache had not been released from service yet, resulting in the linear increase of memory usage.

An engineer quickly turned the flag back on (within 2 minutes of the incident being declared), but this didn’t solve the main issue: all of the Shepherd pods were in CrashLoopBackOff, and whenever a new pod came up, it was overwhelmed with nearly all the traffic that the service normally handles, and crashed.

Around the time that was identified, a communications lead was assigned who created the public Statuspage incident, and we started hearing reports of customers having trouble sending data to the Honeycomb API.

Having at least given the service a solution to the crashes, we needed to get ourselves into a state where it was ready to accept traffic again. The 3 key measures we took were:

Increasing the memory requested and memory maximum of the service - when you’re running out of memory for legitimate reasons, adding resources is a solid option.
Circuit breaking - while we don’t have any automated circuit breakers for the entire service, it certainly can’t recover if new pods are overwhelmed with traffic as soon as they start. We temporarily shut off traffic at the load balancer level in order to get a critical mass of working pods before fully re-enabling requests.
Rolling the cluster - the memory relief caused by re-enabling the feature flag was only fully realized by new pods.

The decision to completely circuit break the service wasn’t taken lightly; but, given that a significant portion of traffic was being rejected anyway, giving the cluster a chance to recover was likely the quickest way out. Traffic was blocked overall for ~8 minutes, after which the cluster became operational after another 6 minutes.

All in all, this was around 30 minutes of outage, with 5 minutes of sporadic flaps back to partial service, then 6 minutes of partial service until recovery.

This incident had many interesting factors - for example, the aforementioned schema caching pattern is extremely new but has already become an indispensable part of the Shepherd service. Turning it on while it was in development on 4/25 caused a 2 minute ingestion outage, but after releasing it to production on 4/27 (with more acclimation to allow the cache to warm up), turning it off on 5/2 caused a 30 minute one. By suddenly switching to a different cache implementation while under load, the memory demands effectively doubled. This, among other aspects, will be discussed in the full report once the post-incident investigation has been completed.

Posted May 12, 2023 - 12:53 PDT

Resolved

The new schema cache working as intended, along with allocating more memory, brought the service back to normal operation once the old resource-constrained instances had exited the fleet. Ingest is now fully operational.

Posted May 02, 2023 - 14:22 PDT

Update

Errors and performance have returned to baseline levels. We're still watching to make sure the fleet remains healthy.

Posted May 02, 2023 - 14:16 PDT

Update

The service seems to be returning to standard service - we're continuing to monitor performance.

Posted May 02, 2023 - 14:13 PDT

Monitoring

We've un-restricted traffic to the service. Errors remain high, but the resource-increased instances of the services seem to be slightly more stable for now.

Posted May 02, 2023 - 14:07 PDT

Update

The service's resources have been increased, and we've temporarily restricted traffic to allow the service to get out of a crash loop.

Posted May 02, 2023 - 13:59 PDT

Identified

Our first-line ingestion service is continually running out of memory. We're working on giving the service more resources, and making sure the database schema cache is working as intended.

Posted May 02, 2023 - 13:52 PDT

Investigating

We are currently investigating high error rates in our ingestion pipeline.

Posted May 02, 2023 - 13:45 PDT

This incident affected: api.honeycomb.io - US1 Event Ingest.