Triggers and SLOs impaired

Incident Report for Honeycomb

Postmortem

On Friday November 18, we’ve experienced roughly 2 hours and 10 minutes of degraded SLO evaluation performance, including two disjoint periods of roughly 20-25 minutes during which SLO alerts were delayed. During this time, your burn alerts may have been delayed or taken longer than usual to resolve after their budget had cleared up. No alerts were lost, and contrary to what we believed during the incident, triggers remained fully functional for the entire time period.

At 2:00 UTC (6pm PT), one of our threshold alerts told us that our SLO database was using more CPU resources than expected. Since we had been rolling out scalability improvements to the SLO services that week, which included migrating backing data to a new table structure, our on-call engineer paused the migration to stabilize the system. Unfortunately, by that time we were already using our RDS instance’s burst capacity to serve steady state traffic, which eventually bottomed out and left us with degrading performance.

Because the triggers and SLOs mechanisms share a critical code path on the read side, we reduced and eventually turned off SLO reading to salvage trigger capacity. During that time, writes and SLO evaluation kept happening, but no aggregation nor alerting would take place.

Our on-call team found demanding database queries, and identified a subtlety in the indexing structure of the new SLO tables that meant performance would slowly degrade as the migration progressed. At this point, database capacity was low enough that we believed modifying the index live was unlikely to succeed.

To restore service, we truncated the stale data contained in the new tables to compensate for their less effective index, kept the older infrastructure as-is, and backed out of the migration. We then gradually restored SLO reads on to make sure performance was acceptable, and eventually closed the incident.

We are planning on fixing the indexing structure during working hours this week, which should let us improve the scalability of our SLO services.

Posted Nov 21, 2022 - 07:44 PST

Resolved

In retrospect it appears that Triggers were never delayed by more than 2 minutes.
SLOs were up to 45 minutes delayed but all notifications made it out eventually.
All systems are once again operational.

Posted Nov 18, 2022 - 21:33 PST

Monitoring

Both triggers and SLOs are operating within normal parameters. We are watching to ensure continued recovery.

Posted Nov 18, 2022 - 21:15 PST

Update

Triggers are now functioning normally. SLOs remain slightly delayed.

Posted Nov 18, 2022 - 20:42 PST

Update

We continue to mitigate the load with both triggers. Triggers are intermittent but succeeding; SLOs are still delayed.

Posted Nov 18, 2022 - 20:17 PST

Investigating

Trigger evaluations are delayed by 30-45 minutes and SLO evaluations are temporarily suspended due to higher than normal load. We are investigating the source of the load.

Posted Nov 18, 2022 - 19:18 PST

This incident affected: ui.honeycomb.io - US1 Trigger & SLO Alerting.