On July 28 starting at approximately 01:30am PDT, Honeycomb began experiencing increased error rates and latency across our services, leading to a partial outage ultimately lasting approximately two hours and impacting 0.002% of ingest traffic over that period. The outage was triggered by AWS Route53 suffering lookup failures from EC2 instances in the us-east-1a availability zone (AZ), which hosts many of our instances across all our services.
Our Ingest Latency SLO burn alert triggered at 01:53am PDT. On-call investigated and used Honeycomb (BubbleUp specifically) to determine that our ingest instances in us-east-1a were to blame. On-call responded by removing that AZ from the autoscaling group, which reacted by replacing the problematic hosts with new ones in unaffected AZs. This migration finished at 03:15am PDT. There was a small amount of ingested event loss due to the outage (roughly 35000 event payloads, 0.002% of our traffic over that period) because retries for sending the events to Kafka were exhausted on the affected ingest hosts.
While ingest is the most important service to fix quickly, this outage also affected other services: We saw elevated response times for our UI and storage service. Some triggers and burn alerts also failed to evaluate during that time window. If you believe you have been impacted by a trigger or alert failing to run during this period, please reach out to our support team at honeycomb.io/support.
Throughout this incident, while degraded, Honeycomb remained usable and we used it to debug the incident.
AWS declared the outage resolved at 04:37am PDT, but we delayed transitioning ingest hosts back to us-east-1a until 09:45am PDT.