Trigger & Querying Degraded

Incident Report for Honeycomb

Postmortem

Preliminary report

On September 9, we’ve experienced roughly 3 hours and a half of degraded query time, of missing trigger runs, and of late SLO processing. This outage was a repeat incident from the one we had on December 7 2021, where our secret store provider, used by all our components to load their configuration, suffered an outage where most of the invocations would fail.

We detected the outage after a deployment, which forced all instances of the querying services to do a rolling restart. Unfortunately, that rolling restart had left some of the hosts in a non-functional state by not being able to fetch configuration values such as database connection strings, or other key parameters.

We reused emergency procedures from a previous incident and started working on mitigation techniques by manually recovering production configuration data and moving it to different storage locations so the services could come up.

In doing so, we’ve been able to progressively bring up querying ability, albeit with stale data at first, then restore querying in full, and then worked to restore triggers and SLO runs. SLO runs that had been delayed replayed their old data and alerted everyone they had to alert, but failed trigger runs are not retried due to their current implementation.

During the incident, we also identified a new workaround for the future, where we are going to keep maintaining passive synchronization of multi-region secret storage with the ability to quickly switch to a fail-over zone if this were to happen again.

At this point in time, we do not plan to release a more in-depth incident review, as we are focusing our attention on the September 8 outage related to ingest delays and its investigation.

Posted Sep 21, 2022 - 08:08 PDT

Resolved

Everything in the platform is back to its expected state. We are considering this incident complete.
We'll keep working on mitigation mechanisms in case of any new instability we can detect.

Posted Sep 09, 2022 - 11:38 PDT

Monitoring

We are back to having all services operational, and are continuing to monitor in case of regression. We will continue to work on mitigations in case of similar issues in the future.

Posted Sep 09, 2022 - 11:16 PDT

Update

We are continuing to work towards restoring triggers and SLOs. Our current efforts are focused on migrating some data to a different region to avoid current issues with AWS.

Posted Sep 09, 2022 - 10:56 PDT

Update

While querying continues to be stable, we are continuing to work on a fix for triggers & SLOs.

Posted Sep 09, 2022 - 10:23 PDT

Update

We believe that querying is stable, and we are continuing to work on a fix for triggers & SLOs.

Posted Sep 09, 2022 - 09:52 PDT

Update

We believe that queries should be stabilizing - we are working on fixes to bring triggers back online.

Posted Sep 09, 2022 - 09:19 PDT

Update

We're slowly making progress to bring back querying. You may be able to successfully query via ui.honeycomb.io, but the data will likely be stale.

Posted Sep 09, 2022 - 08:48 PDT

Update

We have narrowed the problem down to AWS SSM availability and are working around it to bring back querying.

Posted Sep 09, 2022 - 08:36 PDT

Update

Querying is severely degraded, we are continuing to work on implementing the fix.

Posted Sep 09, 2022 - 08:18 PDT

Identified

Querying, Triggers, and SLOs are currently degraded. We have identified the cause of the issue and are implementing a fix.

Posted Sep 09, 2022 - 08:10 PDT

This incident affected: ui.honeycomb.io - US1 Querying and ui.honeycomb.io - US1 Trigger & SLO Alerting.