On April 16, we’ve experienced 55 minutes of degraded query performance in interactive queries and board rendering for a dozen or so teams. During this time, queries that were usually fast would have started taking much longer than usual, from less than 5 seconds to about a minute. More importantly though, for about 25 minutes, the evaluation of triggers and SLOs in our US region was interrupted, meaning alerts may have been delayed or missed.
The detection of slow queries mostly came up through customers reaching out to us. On our end, the main performance SLOs never fell below their thresholds and we overall were within our budget. We associated the raising delays to an increase in shared lambda resource, caused by background tasks being queued up, which in turn created contention for some queries. As we started an internal incident to handle this, we were paged about our alerting subsystem not reporting as healthy.
We saw the contention in the underlying resources as the main contributor and tweaked some rate limiting parameters to ensure overall usage came back to manageable levels. As we did so, the alerting system also recovered. We monitored the system and made sure it was functioning as normal for a while before closing the incident.
Our investigation mostly focused on what exactly caused alerting to hang, a behavior that surprised every responder. A key behavior we focused on was that the system worked fine under pressure until an automated deployment happened. We eventually found out that while resource contention in our lambdas did lead to slowness for queries, it was coming back from the deployment while under pressure that caused the stalling.
As it turns out, that application does gradual backfilling of recently changed SLOs in the background. However, in its initial iteration, it performs this task at boot time in the foreground and then moves it to the background. Because the application restarted while the system was under heavy contention, it stalled on that first run, and did not recover while load remained high. When we solved the contention issue, background jobs managed to finish, then moved to be asynchronous, and alerting came back.
Our two follow-up actions have been to tweak the alerting for our triggers and SLO components so they page roughly 3-5x faster next time, and to make sure the first evaluation of background tasks is done asynchronously, as we initially expected them to be.
We do not plan on doing further in-depth reviews of this incident at this time.