On Friday November 18, we’ve experienced roughly 2 hours and 10 minutes of degraded SLO evaluation performance, including two disjoint periods of roughly 20-25 minutes during which SLO alerts were delayed. During this time, your burn alerts may have been delayed or taken longer than usual to resolve after their budget had cleared up. No alerts were lost, and contrary to what we believed during the incident, triggers remained fully functional for the entire time period.
At 2:00 UTC (6pm PT), one of our threshold alerts told us that our SLO database was using more CPU resources than expected. Since we had been rolling out scalability improvements to the SLO services that week, which included migrating backing data to a new table structure, our on-call engineer paused the migration to stabilize the system. Unfortunately, by that time we were already using our RDS instance’s burst capacity to serve steady state traffic, which eventually bottomed out and left us with degrading performance.
Because the triggers and SLOs mechanisms share a critical code path on the read side, we reduced and eventually turned off SLO reading to salvage trigger capacity. During that time, writes and SLO evaluation kept happening, but no aggregation nor alerting would take place.
Our on-call team found demanding database queries, and identified a subtlety in the indexing structure of the new SLO tables that meant performance would slowly degrade as the migration progressed. At this point, database capacity was low enough that we believed modifying the index live was unlikely to succeed.
To restore service, we truncated the stale data contained in the new tables to compensate for their less effective index, kept the older infrastructure as-is, and backed out of the migration. We then gradually restored SLO reads on to make sure performance was acceptable, and eventually closed the incident.
We are planning on fixing the indexing structure during working hours this week, which should let us improve the scalability of our SLO services.