At 07:05 UTC on October 22, 2025, Honeycomb Platform On-Call received a page indicating that our End to End tests that run every 5 minutes had failed. The responding engineer logged in to find multiple systemic failures, involving querying, SLO evaluation, and noted the lack of ability to log in to the AWS console.
6 minutes later at 07:11 UTC, AWS declared an outage affecting us-east-1. This outage would not be declared resolved until 22:53 UTC, nearly sixteen hours later. Even at the time of AWS-indicated resolution, services continued to be impacted by throttling imposed to permit system recovery.
During the duration of the incident, we remained able to ingest events; no events that were received by our ingest hosts were dropped due to system failures. Querying was affected between approximately 13:30 and 14:06 UTC, 15:00 and 16:00 UTC, 16:10 UTC and 16:45 UTC, 17:12 UTC and 17:21 UTC, and from 18:40 to 19:06 UTC, which also created delays in evaluations of Triggers and SLOs during those periods.
AWS described the trigger of the outage as being DNS resolution for internal DynamoDB endpoints. Once this DNS resolution issue was handled, issues affecting dependent systems within AWS became visible. The most notable symptoms for Honeycomb were errors while launching instances and failed Lambda invocations. The Lambda failures increased the difficulty of seeing exactly which systems were impacted since under normal operations Honeycomb relies on Lambda to power the query functionality, and EC2 scheduling issues reduced our capacity to keep up with SLO evaluation; however, we were able to maintain the EC2 capacity we already had in order to ensure ingest continued to function. We also had degraded performance on ancillary systems like Service Maps due to reduced overall capacity.
Multiple providers we rely on also experienced significant hardship with the AWS outage, limiting our ability to make on-the-fly changes to the running system. Ordinarily, if particular features are inaccessible or otherwise in a problematic state, we have levers to de-activate the features and ensure the stability of the key parts of Honeycomb. We were able to mitigate some aspects of the failure without this, but several of these were unavailable to us in this case - for example, to allow querying to continue in a limited capacity while Lambda was inaccessible.
Because of the duration and severity of the incident, we plan on conducting a full incident review with further discussions where we will re-evaluate our Disaster Recovery story, reliance on external vendors, and regional dependence.