Honeycomb UI down

Incident Report for Honeycomb

Postmortem

On June 13, at 18:35 UTC, we’ve experienced about 10 minutes of instability followed by 1m30s of complete downtime in our front-end application in the US region. During the initial instability period, interactions with the Honeycomb web application may have been slow or have required retries. During the complete downtime, they have flat out failed. Other components, such as API queries, alerting, or telemetry ingestion were unaffected.

While the initial instability period was visible to our observability data, the error rate had been too minor to detect by our automation’s threshold, given the aggregated error rate went from ~0.2% to ~0.3% of all requests failing and remained under our alerting thresholds. Front-end hosts were restarting due to running out of memory one at a time, but recovering fast enough to keep the overall platform functional. We received alerts about these restarts at roughly the same time all remaining instances suddenly went down at once, at 18:45 UTC, which is when all front-end requests failed and more alerts were sent to our engineers.

We quickly narrowed down the problem to a feature flag we had turned on at about 18:25 UTC. We turned the flag off, and let the system stabilize. The flag in question was gating a set of internal modifications about how Honeycomb query schema translations are done on the front-end service before being passed to our back-end systems.

While monitoring system recovery, we figured out that the new implementation required more memory than we had anticipated, by holding on it for too long in the query lifecycle. The cost of translating while keeping that memory longer than necessary, coupled with specific (but normal) query patterns, could tip our front-end hosts over their memory limit faster than they could scale up or recover.

Our plans at this point are to both review the translation code, and revisit the memory and scaling allocations made for the front-end component. We do not plan on posting a more in-depth review at this time.

Posted Jun 17, 2025 - 06:16 PDT

Resolved

This incident has been resolved.
Posted Jun 13, 2025 - 12:31 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jun 13, 2025 - 11:53 PDT

Identified

We have identified the feature flag that triggered this and are rolling it back.
Posted Jun 13, 2025 - 11:51 PDT
This incident affected: ui.honeycomb.io - US1 Querying and ui.honeycomb.io - US1 App Interface.