We ran into a bug in Kafka controller identification/election (brains were split) on Oct 17 which caused a partial outage of our ingest service. TLDR below or read more in our blog post!
What happened:
- Kafka (part of our ingest pipeline) suffered a split brain problem because of a bug in versions previous to 0.10.2.1
Impact:
- 6:03-10:45am PDT: Some (33%, actually) teams actively sending data had a partial loss of writes--of those, most lost <50% of writes
- 10:50-11:20am PDT: majority of users experienced a partial query failures while we rolled things
But:
- No existing data was impacted
Highlights of what we're doing in response:
- Planning a Kafka upgrade!
- Better instrumentation for Kafka and ZooKeeper into Honeycomb
- Change the way our data nodes handle invariants to deal with out-of-order offsets
As always, let us know if you have questions -- support@honeycomb.io
<3