Ingestion delays

Incident Report for Honeycomb

Postmortem

Preliminary report

On September 8, we’ve suffered almost 9 hours of partial or near-total ingest outage. During this time period, most of the data sent to the Honeycomb ingestion endpoints ended up lost in errors or severely delayed. This preliminary report attempts to provide a high-level explanation of what happened, pending an in-depth incident review in the upcoming weeks.

Currently, most signs point to the incident having been cause by reaching an implicit scaling limit on our ingestion service, related to its ability to fill its local cache. The local cache contains information such as customers’ datasets, schema, or teams, and is used to prepare data before it makes it to columnar storage. At the time of the incident, the cache lines would fill themselves on-demand, the first time a given instance encountered a specific piece of data. Filling in the cache lines is necessary to process information, but is a non-blocking operation, where one customer’s missing cache does not prevent other customers from doing work.

We had signs in the recent past that the current implementation was reaching its limits and corrective work was scheduled, but we apparently hit a yet-unknown tipping point during this incident. This tipping point led to accidental synchronized lock-ups of various ingest hosts, which all shed their cache at once, then stalled each other in re-filling it. This in turn led to bubbling memory, hanging execution of ingest, and failing health checks. These in turn led our schedulers to restart the ingest hosts (flushing their cache in the process), which self-amplified into continuous crash loops.

These errors also overloaded some of our downstream services doing sampling and providing our own operational data, which meant we in turn lost our own observability data to investigate the situation. We still do not know what specific interplay of customer data, volume, scale, locking behaviours, and cache invalidation led to this particular outage. However, knowing the recent signs and context, we knew it was generally related to the caching behaviour.

Most of the work that day was spent trying various mitigation techniques to re-stabilize the cluster, but most failed. The incident only got resolved when our engineers reworked the caching mechanism to pre-fill itself with recent data fetched before it would start accepting traffic, which had been added to our backlog already. This development work bypassed the self-amplifying cache synchronization that led to crash-loops, and the ingest layer has been stable since then.

An in-depth incident review will be published in the next few weeks as we keep investigating what happened to draw the most useful lessons for the future.

Complete Report

We have completed our incident review of this outage. The full and abridged reports are available at https://www.honeycomb.io/blog/incident-review-shepherd-cache-delays

Posted Sep 21, 2022 - 08:04 PDT

Resolved

Event ingestion has remained at normal levels, we are considering this incident complete.

Posted Sep 09, 2022 - 12:16 PDT

Monitoring

We have deployed a fix that has stabilized the ingest pipeline and are continuing to monitor for any additional impacts

Posted Sep 08, 2022 - 21:47 PDT

Update

Work continues on stabilizing the ingest pipeline

Posted Sep 08, 2022 - 20:46 PDT

Update

We are continuing to investigate and work towards stability.

Posted Sep 08, 2022 - 19:24 PDT

Update

Ingestion pipeline remains in an unstable state but we are still working towards stability

Posted Sep 08, 2022 - 17:27 PDT

Update

Ingestion is still seeing occasional spikes in latency and increased error rates. We are working to stabilize the ingest pipeline.

Posted Sep 08, 2022 - 16:35 PDT

Update

Ingestion appears to remain stable. We are monitoring other systems to ensure their reliability and investigating possible causes.

Posted Sep 08, 2022 - 15:57 PDT

Update

We have increased the capacity of the fleet; error rates and delays appear reduced, but not resolved. We are still investigating to ensure continued stability.

Posted Sep 08, 2022 - 15:03 PDT

Update

We are increasing the capacity of our fleet and continuing to mitigate backpressure on our ingestion pipeline. You may continue to see delayed responses and increased HTTP errors.

Posted Sep 08, 2022 - 14:16 PDT

Update

We are continuing to investigate and roll out mitigations in order to reduce internal load and increase availability of the ingestion service. Our post-ingestion services are healthy and being monitored.

Posted Sep 08, 2022 - 13:31 PDT

Update

We are continuing to investigate this issue.

Posted Sep 08, 2022 - 12:59 PDT

Update

We are continuing to investigate the issue. You may see delayed responses from our API and increased 502 errors from our event ingest service.

Posted Sep 08, 2022 - 12:57 PDT

Investigating

We are currently investigating an issue with delayed responses by our event ingest service.

Posted Sep 08, 2022 - 12:21 PDT

This incident affected: api.honeycomb.io - US1 Event Ingest.