UI and API unavailable

Incident Report for Honeycomb

Postmortem

On August 6, we experienced an outage impacting multiple components of our platform, between 12:59:39 PDT and 13:20:05 PDT. Within that time range, and for 17 minutes, roughly 25% of incoming telemetry data was rejected; our API rejected 75% of requests (mostly to the /1/auth endpoint); the ui.honeycomb.io website was completely unusable for at least 19 minutes; triggers weren’t evaluated for that time, and finally, SLO evaluations may have been delayed or issues may have happened in sending out notifications.

Our engineers noticed a degradation at roughly 13:00 PDT; alerts confirming a major issue went out at 13:04 PDT, and we spun up our internal incident response in parallel.

As most components started suffering at the same time, right around a deployment, it took a few minutes to properly get situated and narrow down the issue to database performance, correlated with a table schema migration. We managed to identify a stuck query, but by the time we knew exactly which one was involved, the database was so overloaded we could not log in with the elevated privileges required to terminate it, and had to fail the database over. This resolved the issue, and we spent a few more minutes making sure all data was correct and that all subsystems recovered properly.

The schema migration was technically safe—a column addition to the teams table using an INSTANT algorithm that should cause no downtime nor interruption. Unbeknownst to us, merely a few seconds before the migration was applied, a read query doing a costly SELECT started running. This query had been mostly unchanged for the last 5 years and never caused issues, while being called roughly 10 times a day.

The migration query modifying the same table got scheduled at the same time. It acquired a metadata lock that then prevented any other query from running on this table, while the ALTER statement itself waits for already running queries and transactions using this table to terminate. This is usually a short wait, and as soon as the ALTER statement is scheduled, other operations can in turn be scheduled concurrently.

Our investigation reveals that this specific slow SELECT query run could easily take more than 5 minutes to complete for some customer organizations. Generally, this isn’t a problem as these queries can run concurrently and do not block other operations; the client connection from our software times out and returns quickly while the query terminates later in MySQL.

The end result is an unfortunate scheduling edge case within MySQL where a generally non-blocking query stalled a data schema change that is also generally non-blocking. But because the query extended in time, everything having to do with teams—such as authentication—hung behind the slow query (which held back the ALTER, which held back all other queries until it could be scheduled), and many systems in turn became unresponsive. The same migration was re-applied without problems a few minutes later.

We are currently auditing the specific query that took long enough to contribute to the outage, to see if it can be optimized or to ensure it times out much faster on the database’s side. Following this, we are hoping to better enforce database-side timeouts in general to align them with our client-side timeouts. This should ensure that schema migrations that should otherwise be safe actually are so.

We do not plan a more in-depth public review at this time, although we will continue investigating these events internally.

Posted Aug 07, 2024 - 15:42 PDT

Resolved

We have confirmed all Honeycomb services are once again operational.

Posted Aug 06, 2024 - 14:00 PDT

Monitoring

Both the UI and API are once again functional and we are following up on the related changes.

Posted Aug 06, 2024 - 13:29 PDT

Investigating

The Honeycomb UI is unavailable to many customers and some traffic is being rejected at the API. We have identified an overloaded database table and are working to mitigate the issue.

Posted Aug 06, 2024 - 13:12 PDT

This incident affected: api.honeycomb.io - US1 Event Ingest, ui.honeycomb.io - US1 Querying, ui.honeycomb.io - US1 App Interface, and ui.honeycomb.io - US1 Trigger & SLO Alerting.