On September 8, we’ve suffered almost 9 hours of partial or near-total ingest outage. During this time period, most of the data sent to the Honeycomb ingestion endpoints ended up lost in errors or severely delayed. This preliminary report attempts to provide a high-level explanation of what happened, pending an in-depth incident review in the upcoming weeks.
Currently, most signs point to the incident having been cause by reaching an implicit scaling limit on our ingestion service, related to its ability to fill its local cache. The local cache contains information such as customers’ datasets, schema, or teams, and is used to prepare data before it makes it to columnar storage. At the time of the incident, the cache lines would fill themselves on-demand, the first time a given instance encountered a specific piece of data. Filling in the cache lines is necessary to process information, but is a non-blocking operation, where one customer’s missing cache does not prevent other customers from doing work.
We had signs in the recent past that the current implementation was reaching its limits and corrective work was scheduled, but we apparently hit a yet-unknown tipping point during this incident. This tipping point led to accidental synchronized lock-ups of various ingest hosts, which all shed their cache at once, then stalled each other in re-filling it. This in turn led to bubbling memory, hanging execution of ingest, and failing health checks. These in turn led our schedulers to restart the ingest hosts (flushing their cache in the process), which self-amplified into continuous crash loops.
These errors also overloaded some of our downstream services doing sampling and providing our own operational data, which meant we in turn lost our own observability data to investigate the situation. We still do not know what specific interplay of customer data, volume, scale, locking behaviours, and cache invalidation led to this particular outage. However, knowing the recent signs and context, we knew it was generally related to the caching behaviour.
Most of the work that day was spent trying various mitigation techniques to re-stabilize the cluster, but most failed. The incident only got resolved when our engineers reworked the caching mechanism to pre-fill itself with recent data fetched before it would start accepting traffic, which had been added to our backlog already. This development work bypassed the self-amplifying cache synchronization that led to crash-loops, and the ingest layer has been stable since then.
An in-depth incident review will be published in the next few weeks as we keep investigating what happened to draw the most useful lessons for the future.
We have completed our incident review of this outage. The full and abridged reports are available at https://www.honeycomb.io/blog/incident-review-shepherd-cache-delays