DB migration slowing the interface down a bit

Incident Report for Honeycomb

Resolved

Service maps are now caught up, so we are moving this incident to resolved.

Posted Sep 01, 2023 - 17:13 PDT

Update

Query and application performance have returned to normal, but as a result of the incident we are behind on ingesting the events that build our service maps. We will provide a final update once service maps have caught up.

Posted Sep 01, 2023 - 14:32 PDT

Monitoring

We have failed over the affected database which relieved the database load issue we encountered. Unfortunately, this caused a short period of query failures while we restarted a service that did not gracefully reconnect to the DB. Querying and performance should now be normal.

Posted Sep 01, 2023 - 14:29 PDT

Update

We've fixed everything query side and are still seeing issues. We are currently diving deeper into our database to see what we can find out about the unexpected performance degradation. We have disabled service map updates in the meanwhile to preserve full capacity for other features.

Posted Sep 01, 2023 - 12:47 PDT

Identified

We've identified unrelated queries and specific ingest patterns that might be related to performance issues. We're still stable albeit a bit slower at times, but we're looking at whether addressing them brings more performance back.

Posted Sep 01, 2023 - 10:39 PDT

Update

Performance seems to be recovering and SLO/trigger performance is back within normal limits, but there is still a slightly elevated amount of query errors, including on the homepage.

Posted Sep 01, 2023 - 09:44 PDT

Update

The database migration has finished, but internal processes on the host have brought it to full utilization. We're investigating strategies to mitigate the effects until it catches up.

Posted Sep 01, 2023 - 09:17 PDT

Monitoring

A database migration has had a slightly larger impact than we expected. The interface might be slower and some triggers could take a few extra seconds to run in aggregate, but we expect everything to be fully functional and to ride it out until the end of the migration.

Posted Sep 01, 2023 - 08:17 PDT

This incident affected: ui.honeycomb.io - US1 Querying, ui.honeycomb.io - US1 App Interface, and ui.honeycomb.io - US1 Trigger & SLO Alerting.