API service partition

Incident Report for Honeycomb

Postmortem

We ran into a bug in Kafka controller identification/election (brains were split) on Oct 17 which caused a partial outage of our ingest service. TLDR below or read more in our blog post!

What happened:

Kafka (part of our ingest pipeline) suffered a split brain problem because of a bug in versions previous to 0.10.2.1

Impact:

6:03-10:45am PDT: Some (33%, actually) teams actively sending data had a partial loss of writes--of those, most lost <50% of writes
10:50-11:20am PDT: majority of users experienced a partial query failures while we rolled things

But:

No existing data was impacted

Highlights of what we're doing in response:

Planning a Kafka upgrade!
Better instrumentation for Kafka and ZooKeeper into Honeycomb
Change the way our data nodes handle invariants to deal with out-of-order offsets

As always, let us know if you have questions -- support@honeycomb.io

Posted Oct 18, 2017 - 17:20 PDT

Resolved

Issue is resolved, but we're still watching it in case there are any aftershocks. Full post mortem will go out later today!

Posted Oct 17, 2017 - 12:53 PDT

Monitoring

Fixes have been applied and we are monitoring the situation to verify full resolution.

Posted Oct 17, 2017 - 11:57 PDT

Identified

The problem has been identified as a bug in Kafka controller behavior. Resolution under way!

Posted Oct 17, 2017 - 10:25 PDT

Investigating

We are investigating a state-of-the-world-agreement failure one of the components of our API service that is impacting ingest for some (but not all) Honeycomb users as of 8am PDT.

Posted Oct 17, 2017 - 09:16 PDT