On July 25 2023, we experienced a total Honeycomb outage, spanning ingest, querying, triggers, SLOs, and our API. Following 10 minutes of partially degraded ingestion, we saw a rapid failure cascade that took down most of our services. The outage impacted all user-facing components from 13:40 UTC to 14:48 UTC, during which time no data could be processed or accessed, and everything but the public Honeycomb website were inaccessible. The user interface recovered around 14:48 UTC, and querying slowly started recovering unevenly across all partitions and teams. During this time, many requests may have either contained outdated information or flat out failed. Ingestion came back around 15:15 UTC, at which point we could accept incoming traffic and API requests again. Query capacity, along with trigger and SLO alerting, kept slowly getting more accurate and succeeding for more and more users, until service became normal again at 15:35 UTC.
This gives us roughly 1h10 of total outage, with an extra 28 minutes of ingest outage (10 minutes partial, 18 minutes total), and 47 minutes of degraded querying and alerting (both triggers and SLOs), for a bit more than 2 hours of incident time.
As part of infrastructure changes - we were alternating between minor variations of our storage engine, and shifting traffic for queries across them. Part of the functionality that we shifted across versions included updating the time that we last ingested data for each dataset and field. That’s a switch we had done multiple times in the recent past, but unbeknownst to us, it had partially failed on July 24, at the end of the day. This field not being updated ended up in turn slowly de-populating the cache we use for the schemas of all users’ datasets, and the ingestion endpoint started hitting the storage backend more aggressively. (Note: the cache is also a somewhat more recent addition to our system - so we’re still learning about its operational characteristics now that it’s in production.) Other side-effects of this field not being updated include the “Last Data Received” fields in the schema user interface being inaccurate.
At that point, however, we had only noticed minor performance impact on ingest, and were unaware of the cache emptying itself. Since it was getting later at night and our traffic naturally became lower, the problem appeared minor and was scheduled to be remediated the morning after. On July 25th, a few hours before the incident, we had managed to identify the problem and found the solution. As we were about to roll the storage engine hosts to bring the cache data back, morning traffic ramped up. The resulting scale-up of ingest hosts increased the load on our database, with more hosts fetching un-cached data. At roughly the same time, some users started sending data which required schema changes, and this pushed our database over. It hard-locked and performance went to zero, which we currently suspect to be due to a database-internal deadlock. Connection rates increased across the cluster, until all services would hang, fail to connect, time out, crash as their own workload increased without ability to process them, and fail to come back up.
Since we knew there was a direct relationship between ingest burden and the servers crashing, and that we needed to bring the storage engine back up to allow the front-end ingest to re-hydrate the cache and keep it steady, we decided to cut off all ingest traffic and return static 500 errors. In parallel, we started investigating the database, attempting to un-wedge it so it could come back active and warmed up, but since the issue was deeper in its internals and administrative commands wouldn’t work, we had to fail over to a hot stand-by in another availability zone. This allowed services to come back up, but until we could confirm the cache data would be functional, we decided to keep ingest traffic offline—traffic without a cache would just re-create the same outage again.
We ended up back-filling the cache, confirming it worked, and then allowing general ingest traffic to come in multiple short bursts so we could see that the cache could sustain itself, and that the database managed to be warm and able to take on the expected traffic. We gradually increased the traffic bursts until everything was back to normal, and then we shifted to gradually correcting all the storage engine partitions that were not properly answering to their queries, and bringing both triggers and SLO functionality back online as well as the system recovered.
At this point in time, we have not yet identified everything it is we could do to improve the situation, but we do have tons of ideas to sort through and prioritize already. We are planning on writing a more in-depth review and report in the next few weeks, which should both look at the response and at the underlying mechanisms in depth. We value our customers; we recognize and apologize for the impact this incident has on you. (We got messages of #HugOps from some of our users mid-incident—never expected, but always appreciated by the team working to restore normal operations.)