Querying errors

Incident Report for Honeycomb

Postmortem

Honeycomb queries and triggers were impaired from 19:00 to 20:28 UTC on June 13, 2023 due to an outage in our upstream service provider. The following is an overview of what happened during this incident.

During this incident, certain queries (described below) failed, usually after a long delay. This applied to queries on https://ui.honeycomb.io/, Triggers, and queries sent through the Query Data API. Data ingestion remained unaffected, and no user data was lost during this time.

Our data storage has two tiers, “hot” and “cold.” Queries against cold storage are performed using AWS Lambda (more detail in this article). Degradation of the Lambda service meant that any query against cold data failed. Queries against hot data were unaffected.

We transition data from hot to cold storage within 24 hours after we receive it. However, the exact time of this transition depends on the rate at which any given customer sends us data. For that reason, it is not possible to give a specific age cutoff at which queries began to fail during this incident.

Most triggers tend to be against recent data, and our systems continued to evaluate and alert on these triggers as usual. Any trigger that queried against cold data failed during this incident. SLOs continued to function for the duration of this incident, and burn alerts were sent as usual.

Posted Jun 15, 2023 - 08:09 PDT

Resolved

This incident has been resolved, and our temporary mitigations have been removed.

Posted Jun 13, 2023 - 14:28 PDT

Update

Our mitigations along with some amount of AWS service recovery seems to have led to normal levels of service, but we're continuing to monitor to ensure continued reliability.

Posted Jun 13, 2023 - 13:58 PDT

Update

We're seeing some positive signals from our infrastructure, including a reduction in failing queries, but are continuing to monitor.

Posted Jun 13, 2023 - 13:48 PDT

Monitoring

We are continuing to monitor the status of AWS.

Posted Jun 13, 2023 - 13:40 PDT

Update

Queries and triggers are more likely to succeed if they use smaller windows to ensure Lambda isn't invoked in the query path - for higher-traffic datasets, this could mean a window of the last 2 hours, and for lower-traffic datasets, it could be up to 24 hours that can be more reliably queried.

Posted Jun 13, 2023 - 13:11 PDT

Identified

While AWS experiences issues and our clusters are increasingly affected, we're monitoring our infrastructure to ensure data integrity (that traffic we receive isn't lost).

Posted Jun 13, 2023 - 12:57 PDT

Update

Lambda seems to be largely affected, and customers may see some (but not all) of their queries fail.

Posted Jun 13, 2023 - 12:38 PDT

Update

AWS is currently experiencing issues with many services - we are identifying the product surface of Honeycomb that's affected and attempting to mitigate. https://health.aws.amazon.com/health/status

Posted Jun 13, 2023 - 12:34 PDT

Investigating

We are attempting to mitigate an issue with one of the layers in our secrets stack, this is currently impacting some customers ability to query.

Posted Jun 13, 2023 - 12:19 PDT

This incident affected: ui.honeycomb.io - US1 Querying and ui.honeycomb.io - US1 Trigger & SLO Alerting.