Investigating Querying Issues

Incident Report for Honeycomb

Postmortem

At 07:05 UTC on October 22, 2025, Honeycomb Platform On-Call received a page indicating that our End to End tests that run every 5 minutes had failed. The responding engineer logged in to find multiple systemic failures, involving querying, SLO evaluation, and noted the lack of ability to log in to the AWS console.

6 minutes later at 07:11 UTC, AWS declared an outage affecting us-east-1. This outage would not be declared resolved until 22:53 UTC, nearly sixteen hours later. Even at the time of AWS-indicated resolution, services continued to be impacted by throttling imposed to permit system recovery.

During the duration of the incident, we remained able to ingest events; no events that were received by our ingest hosts were dropped due to system failures. Querying was affected between approximately 13:30 and 14:06 UTC, 15:00 and 16:00 UTC, 16:10 UTC and 16:45 UTC, 17:12 UTC and 17:21 UTC, and from 18:40 to 19:06 UTC, which also created delays in evaluations of Triggers and SLOs during those periods.

AWS described the trigger of the outage as being DNS resolution for internal DynamoDB endpoints. Once this DNS resolution issue was handled, issues affecting dependent systems within AWS became visible. The most notable symptoms for Honeycomb were errors while launching instances and failed Lambda invocations. The Lambda failures increased the difficulty of seeing exactly which systems were impacted since under normal operations Honeycomb relies on Lambda to power the query functionality, and EC2 scheduling issues reduced our capacity to keep up with SLO evaluation; however, we were able to maintain the EC2 capacity we already had in order to ensure ingest continued to function. We also had degraded performance on ancillary systems like Service Maps due to reduced overall capacity.

Multiple providers we rely on also experienced significant hardship with the AWS outage, limiting our ability to make on-the-fly changes to the running system. Ordinarily, if particular features are inaccessible or otherwise in a problematic state, we have levers to de-activate the features and ensure the stability of the key parts of Honeycomb. We were able to mitigate some aspects of the failure without this, but several of these were unavailable to us in this case - for example, to allow querying to continue in a limited capacity while Lambda was inaccessible.

Because of the duration and severity of the incident, we plan on conducting a full incident review with further discussions where we will re-evaluate our Disaster Recovery story, reliance on external vendors, and regional dependence.

Posted Oct 22, 2025 - 16:03 PDT

Resolved

All affected systems have fully recovered.
Posted Oct 20, 2025 - 03:04 PDT

Monitoring

We are beginning to see recovery across our systems. We will continue to monitor the situation in case of any regressions.
Posted Oct 20, 2025 - 02:49 PDT

Update

We are currently experiencing degraded performance. Due to an ongoing issue with AWS, some queries over historical data may fail to complete successfully or take longer than usual. This includes trace queries. SLO and Trigger alerts may be missed or delayed. Queries may return results, but those results may not be saved, meaning that loading queries that other people have run may fail. After the problem resolves, queries may need to be rerun.

Our team is closely monitoring the situation. We’ll provide updates here as more information becomes available.
Posted Oct 20, 2025 - 01:46 PDT

Update

We are currently experiencing degraded performance for queries. Due to an ongoing issue with AWS, some queries over historical data may fail to complete successfully or take longer than usual. This includes queries for SLOs, Triggers, and Traces.

Our team is closely monitoring the situation. We’ll provide updates here as more information becomes available.
Posted Oct 20, 2025 - 01:14 PDT

Identified

We are currently experiencing degraded performance for queries that access older data stored in cold storage. Due to an ongoing issue with AWS, some queries over historical data may fail to complete successfully or take longer than usual. This includes queries for SLOs and Triggers.

Queries over recent or “hot” data are not impacted.

Our team is closely monitoring the situation. We’ll provide updates here as more information becomes available.
Posted Oct 20, 2025 - 00:24 PDT
This incident affected: ui.honeycomb.io - US1 Querying and ui.honeycomb.io - US1 Trigger & SLO Alerting.