We deployed a code change to our storage management system which overwhelmed our production MySQL database. This impacted our data ingestion service, which was unable to receive events for 22 minutes from 19:08 UTC to 19:30 UTC. The data ingestion service began recovering at 19:30 UTC, but continued to experience errors at a decreasing rate until 19:57 UTC.
Stabilization steps were initiated at 19:18 UTC with a deployment rollback, and a fix was merged at 20:03 and deployed at 20:19 UTC. We continued to monitor until 21:00 UTC when the incident was declared as fully resolved.