On September 9, we’ve experienced roughly 3 hours and a half of degraded query time, of missing trigger runs, and of late SLO processing. This outage was a repeat incident from the one we had on December 7 2021, where our secret store provider, used by all our components to load their configuration, suffered an outage where most of the invocations would fail.
We detected the outage after a deployment, which forced all instances of the querying services to do a rolling restart. Unfortunately, that rolling restart had left some of the hosts in a non-functional state by not being able to fetch configuration values such as database connection strings, or other key parameters.
We reused emergency procedures from a previous incident and started working on mitigation techniques by manually recovering production configuration data and moving it to different storage locations so the services could come up.
In doing so, we’ve been able to progressively bring up querying ability, albeit with stale data at first, then restore querying in full, and then worked to restore triggers and SLO runs. SLO runs that had been delayed replayed their old data and alerted everyone they had to alert, but failed trigger runs are not retried due to their current implementation.
During the incident, we also identified a new workaround for the future, where we are going to keep maintaining passive synchronization of multi-region secret storage with the ability to quickly switch to a fail-over zone if this were to happen again.
At this point in time, we do not plan to release a more in-depth incident review, as we are focusing our attention on the September 8 outage related to ingest delays and its investigation.