Honeycomb Ingest Outage

Incident Report for Honeycomb

Postmortem

We deployed a code change to our storage management system which overwhelmed our production MySQL database. This impacted our data ingestion service, which was unable to receive events for 22 minutes from 19:08 UTC to 19:30 UTC. The data ingestion service began recovering at 19:30 UTC, but continued to experience errors at a decreasing rate until 19:57 UTC.

Stabilization steps were initiated at 19:18 UTC with a deployment rollback, and a fix was merged at 20:03 and deployed at 20:19 UTC. We continued to monitor until 21:00 UTC when the incident was declared as fully resolved.

Posted May 13, 2020 - 15:35 PDT

Resolved

This incident has been resolved.

Posted May 13, 2020 - 14:00 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 13, 2020 - 13:33 PDT

Identified

The issue has been identified and a fix is being implemented.

Posted May 13, 2020 - 12:58 PDT

Investigating

We are currently investigating this issue.

Posted May 13, 2020 - 12:41 PDT

This incident affected: api.honeycomb.io - US1 Event Ingest, ui.honeycomb.io - US1 Querying, and ui.honeycomb.io - US1 App Interface.