Triggers and SLOs impaired
Incident Report for Honeycomb

On Friday November 18, we’ve experienced roughly 2 hours and 10 minutes of degraded SLO evaluation performance, including two disjoint periods of roughly 20-25 minutes during which SLO alerts were delayed. During this time, your burn alerts may have been delayed or taken longer than usual to resolve after their budget had cleared up. No alerts were lost, and contrary to what we believed during the incident, triggers remained fully functional for the entire time period.

At 2:00 UTC (6pm PT), one of our threshold alerts told us that our SLO database was using more CPU resources than expected. Since we had been rolling out scalability improvements to the SLO services that week, which included migrating backing data to a new table structure, our on-call engineer paused the migration to stabilize the system. Unfortunately, by that time we were already using our RDS instance’s burst capacity to serve steady state traffic, which eventually bottomed out and left us with degrading performance.

Because the triggers and SLOs mechanisms share a critical code path on the read side, we reduced and eventually turned off SLO reading to salvage trigger capacity. During that time, writes and SLO evaluation kept happening, but no aggregation nor alerting would take place.

Our on-call team found demanding database queries, and identified a subtlety in the indexing structure of the new SLO tables that meant performance would slowly degrade as the migration progressed. At this point, database capacity was low enough that we believed modifying the index live was unlikely to succeed.

To restore service, we truncated the stale data contained in the new tables to compensate for their less effective index, kept the older infrastructure as-is, and backed out of the migration. We then gradually restored SLO reads on to make sure performance was acceptable, and eventually closed the incident.

We are planning on fixing the indexing structure during working hours this week, which should let us improve the scalability of our SLO services.

Posted Nov 21, 2022 - 07:44 PST

In retrospect it appears that Triggers were never delayed by more than 2 minutes.
SLOs were up to 45 minutes delayed but all notifications made it out eventually.
All systems are once again operational.
Posted Nov 18, 2022 - 21:33 PST
Both triggers and SLOs are operating within normal parameters. We are watching to ensure continued recovery.
Posted Nov 18, 2022 - 21:15 PST
Triggers are now functioning normally. SLOs remain slightly delayed.
Posted Nov 18, 2022 - 20:42 PST
We continue to mitigate the load with both triggers. Triggers are intermittent but succeeding; SLOs are still delayed.
Posted Nov 18, 2022 - 20:17 PST
Trigger evaluations are delayed by 30-45 minutes and SLO evaluations are temporarily suspended due to higher than normal load. We are investigating the source of the load.
Posted Nov 18, 2022 - 19:18 PST
This incident affected: - US1 Trigger & SLO Alerting.