On June 3rd, we experienced 20 minutes of outage in the US region in querying and a small increase in ingest failures. During this time customers were unable to query their data and alerting was delayed, but less than 0.1% of data sent to us was dropped.
We received an early alert about our ingest system appearing to be unreachable. This correlated strongly with a database schema migration we had just started running at the same time, which was quickly confirmed by engineers. The migration slowed down our biggest database in the US environment and caused operations to pile up.
We opened the public incident, stating that all systems were down. In fact, because our ingest pathway has a robust caching mechanism, we were able to keep accepting the majority of data without issue. Other systems related to querying and triggers were still failing however, and SLO alerting was delayed for multiple minutes.
Our first priority was to cancel the migration, but reaching the database proved to be difficult due to all connections being saturated. We debated failing the database over to another availability zone by rebooting it as well, but by the time we had managed to get a live connection to it, our automated systems had already detected the failure and performed the failover for us after which the system became stable again.
The migration involved was related to modifying an ENUM set on a database table, which unexpectedly caused a full table rewrite. It had previously run without issue on smaller databases, leading to a false sense of security. Additionally, two prior changes to the same ENUM field had not caused any performance issues. After the restart we made sure that data integrity was properly maintained, that all caches were properly aligned, and that the overall migration could safely complete. We are currently looking at strengthening our ability to spot risky migrations ahead of time (regardless of how well they worked on other databases in other environments).