On May 2nd from 4:36 p.m. to 5:00 p.m. PDT, the ingest service for Honeycomb’s US region had an incident that lasted 24 minutes. For the first 17 minutes, the service accepted varying amounts of traffic (roughly 30%) and for the last seven minutes, all incoming traffic was dropped. During this time, customers sending us telemetry (either directly from within the application or via a proxy such as Refinery or the OpenTelemetry Collector) would have seen slow or failed attempts to send telemetry to api.honeycomb.io. Additionally, while querying via the UI was mostly functional, for some of that time responsiveness slowed down and some queries failed.
This incident occurred during routine maintenance of one of our caching servers. In order to sustain the volume of traffic we receive from our customers, we leverage several layers of caches that store frequently accessed information. Each process has an in-memory cache, there is a shared cache, and the database itself. The local and shared caches both expire information as it ages to manage memory use. Because of the layered aspect of these caches, either can be emptied for a short time and the system will continue to function. However, if one of the caches is unavailable for too long, the load will shift to the database.
During this maintenance adjusting the configuration of the shared cache (intended to improve the experience of our largest customers), the shared cache was unavailable for too long, and as the load shifted to the database it became overwhelmed. The remote cache must be filled from the database, so when the database was overwhelmed, the cache could not be filled. This was a reinforcing feedback loop—the more load the database had, the more it needed the cache, and the more difficult it was to fill the cache. At some point, the whole system tipped and the only way to recovery was to block traffic entirely to refill the cache.
These phases of system degradation correspond to the two main phases of the incident. Of the 24 minutes our system was impacted, the first 17 were this increasing struggle to refill the cache as the database became more and more overloaded. The last seven minutes were when we shut off all incoming traffic in order for the database to recover and fill the cache. As soon as the cache was full, we allowed traffic back in the system.
This chart shows some of the interactions described above. The addition of the remote cache to the system removes potential database load and allows the system to scale above what would have been the limit of the database (labeled Safety Limit). When the remote cache clears, load on the database gradually increases from caches expiring. However, there is a window between the time when the cache clears and when the increasing load from expiring caches hits the safety limit—and within that window the system still functions! If the process to refill the cache can succeed within this window, the system stays up. If it cannot, when the blue database line hits the red safety limit line, it becomes impossible to recover the system without taking it offline. So long as this window remains large enough, there are benefits to keeping the caching architecture simple. But when the window becomes too small, there are a few other paths forward.
We can use this chart to help describe changes we can make to the system to make it harder to repeat this incident. There are two things we can change about this chart: we can make the maintenance window larger, and we can reduce the chance we enter the window at all.
In summary, caches make systems able to scale to great heights, but add complexity in operation and understanding to the overall system. Adding them in the right place opens a system to new opportunities, while at the same time making previously-simple behaviors more chaotic and difficult to understand.
For this particular system impacting Honeycomb ingestion, we are both adding some failover to the cache, as well as adjusting our cache timeouts in order to ensure that we enter a maintenance window like this one less often—and that when we do, we have more time available to complete the needed maintenance.