API service partition
Incident Report for Honeycomb
Postmortem

We ran into a bug in Kafka controller identification/election (brains were split) on Oct 17 which caused a partial outage of our ingest service. TLDR below or read more in our blog post!

What happened:

  • Kafka (part of our ingest pipeline) suffered a split brain problem because of a bug in versions previous to 0.10.2.1

Impact:

  • 6:03-10:45am PDT: Some (33%, actually) teams actively sending data had a partial loss of writes--of those, most lost <50% of writes
  • 10:50-11:20am PDT: majority of users experienced a partial query failures while we rolled things

But:

  • No existing data was impacted

Highlights of what we're doing in response:

  • Planning a Kafka upgrade!
  • Better instrumentation for Kafka and ZooKeeper into Honeycomb
  • Change the way our data nodes handle invariants to deal with out-of-order offsets

As always, let us know if you have questions -- support@honeycomb.io

<3

Posted Oct 18, 2017 - 17:20 PDT

Resolved
Issue is resolved, but we're still watching it in case there are any aftershocks. Full post mortem will go out later today!
Posted Oct 17, 2017 - 12:53 PDT
Monitoring
Fixes have been applied and we are monitoring the situation to verify full resolution.
Posted Oct 17, 2017 - 11:57 PDT
Identified
The problem has been identified as a bug in Kafka controller behavior. Resolution under way!
Posted Oct 17, 2017 - 10:25 PDT
Investigating
We are investigating a state-of-the-world-agreement failure one of the components of our API service that is impacting ingest for some (but not all) Honeycomb users as of 8am PDT.
Posted Oct 17, 2017 - 09:16 PDT