Ingestion errors

Incident Report for Honeycomb

Postmortem

On May 2nd from 4:36 p.m. to 5:00 p.m. PDT, the ingest service for Honeycomb’s US region had an incident that lasted 24 minutes. For the first 17 minutes, the service accepted varying amounts of traffic (roughly 30%) and for the last seven minutes, all incoming traffic was dropped. During this time, customers sending us telemetry (either directly from within the application or via a proxy such as Refinery or the OpenTelemetry Collector) would have seen slow or failed attempts to send telemetry to api.honeycomb.io. Additionally, while querying via the UI was mostly functional, for some of that time responsiveness slowed down and some queries failed.

This incident occurred during routine maintenance of one of our caching servers. In order to sustain the volume of traffic we receive from our customers, we leverage several layers of caches that store frequently accessed information. Each process has an in-memory cache, there is a shared cache, and the database itself. The local and shared caches both expire information as it ages to manage memory use. Because of the layered aspect of these caches, either can be emptied for a short time and the system will continue to function. However, if one of the caches is unavailable for too long, the load will shift to the database.

During this maintenance adjusting the configuration of the shared cache (intended to improve the experience of our largest customers), the shared cache was unavailable for too long, and as the load shifted to the database it became overwhelmed. The remote cache must be filled from the database, so when the database was overwhelmed, the cache could not be filled. This was a reinforcing feedback loop—the more load the database had, the more it needed the cache, and the more difficult it was to fill the cache. At some point, the whole system tipped and the only way to recovery was to block traffic entirely to refill the cache.

These phases of system degradation correspond to the two main phases of the incident. Of the 24 minutes our system was impacted, the first 17 were this increasing struggle to refill the cache as the database became more and more overloaded. The last seven minutes were when we shut off all incoming traffic in order for the database to recover and fill the cache. As soon as the cache was full, we allowed traffic back in the system.

This chart shows some of the interactions described above. The addition of the remote cache to the system removes potential database load and allows the system to scale above what would have been the limit of the database (labeled Safety Limit). When the remote cache clears, load on the database gradually increases from caches expiring. However, there is a window between the time when the cache clears and when the increasing load from expiring caches hits the safety limit—and within that window the system still functions! If the process to refill the cache can succeed within this window, the system stays up. If it cannot, when the blue database line hits the red safety limit line, it becomes impossible to recover the system without taking it offline. So long as this window remains large enough, there are benefits to keeping the caching architecture simple. But when the window becomes too small, there are a few other paths forward.

We can use this chart to help describe changes we can make to the system to make it harder to repeat this incident. There are two things we can change about this chart: we can make the maintenance window larger, and we can reduce the chance we enter the window at all.

By increasing the time a cache entry remains valid, we reduce the slope of the blue line after the cache clearing event. In other words, with a fixed number of cache entries, spreading out expirations over more time means fewer expirations per second. This makes the maintenance window larger, giving us more time to complete the maintenance.
By horizontally sharding the remote cache (spreading cache entries across multiple machines), each remote cache server represents only a portion of the total “baseline database load without the cache” volume. In other words, instead of the actual database load reaching for what it would have been without the cache, it will plateau at some lower value. This also reduces the slope of the increase in database load, making the maintenance window larger.
By using failover caches, the remote cache can remain available even as some servers are taken down for maintenance. This reduces the probability that the cache-clearing event happens at all, meaning we don’t even enter into the maintenance window.

In summary, caches make systems able to scale to great heights, but add complexity in operation and understanding to the overall system. Adding them in the right place opens a system to new opportunities, while at the same time making previously-simple behaviors more chaotic and difficult to understand.

For this particular system impacting Honeycomb ingestion, we are both adding some failover to the cache, as well as adjusting our cache timeouts in order to ensure that we enter a maintenance window like this one less often—and that when we do, we have more time available to complete the needed maintenance.

Posted May 15, 2024 - 09:37 PDT

Resolved

This incident has been resolved.

Posted May 02, 2024 - 21:32 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 02, 2024 - 17:06 PDT

Identified

The issue has been identified and a fix is being implemented.

Posted May 02, 2024 - 17:02 PDT

Update

We are continuing to investigate this issue.

EU ingestion is not impacted

Posted May 02, 2024 - 16:46 PDT

Update

We are continuing to investigate this issue.

Posted May 02, 2024 - 16:45 PDT

Investigating

We are currently investigating this issue.

Posted May 02, 2024 - 16:44 PDT

This incident affected: api.honeycomb.io - US1 Event Ingest, ui.honeycomb.io - US1 Querying, ui.honeycomb.io - US1 App Interface, and api.eu1.honeycomb.io - EU1 Event Ingest.