Querying and Ingest issues in EU

Incident Report for Honeycomb

Postmortem

This incident started on December 5th, and is one of the longest in Honeycomb history, having been actively worked on and closed only on December 17th. Due to its impact and duration, we wanted to offer a partial and preliminary report to explain, at a high level, what happened.

On December 5th, at 20:23 UTC, our Kafka cluster suffered a critical loss of redundancy. Our Kafka cluster contains multiple topics, including all telemetry events submitted by Honeycomb users, the rematerialization of state changes into the activity log, and multiple metadata topics used by Kafka to manage its own workloads. This led multiple partitions leaderless and by 20:35 UTC, we were getting alerts that roughly a quarter of our usual event topic partitions were unable to accept writes.

For most Honeycomb customers, this does not result in ingest outages, because traffic gets redirected to other partitions environments can be assigned to. However, we noted some teams for whom all their assigned partitions fit within the impacted set, and for whom all ingest was down (0.23% of all datasets were impacted). By 21:30 UTC, we had identified impacted teams, were working on a traffic reassignment script, and thought other features (such as SLO evaluation) were unimpacted.

However, at 22:08 UTC, we noticed, through the noise of all the alerts that had been going on, that our ingestion fleet had been in overload protection since 20:38 UTC, impacted everyone more broadly by returning errors on api.eu1.honeycomb.io that would have led to dropped events. Stabilization work was done and 6 minutes later (22:14 UTC), ingestion was stabilized, yielding roughly 1h25 of increased ingest error rates on api.eu1.honeycomb.io for everyone.

At 01:30 UTC on Saturday, December 6th, our responders had managed to force new leader elections on all impacted ingestion partitions, which re-established working traffic internally on all but one of them, which remained shuttered in read-only mode. By then, our brokers were severely imbalanced, we attempted to tweak retention settings, and monitored disk usage to come back during the day to repair the remaining broken partitions.

By 11:00 UTC, our disk usage on Kafka brokers reached a threshold where it became necessary to devise and perform emergency operations. We noticed that our metadata partitions, some of which handle the tracking of Kafka offloading its storage to S3 and its autobalancing features, were still leaderless, and we thought this could be the problem.

In fixing them, we also repaired consumer group metadata topics, which revealed, at 14:27 UTC, that our SLO product had been partially stuck since the start of the outage. By having broken consumer groups topic partitions, our event consumers for SLOs had been working fine on some partitions but were fully stalled on others, despite reporting healthy—they were idling as “online” but not seeing that they were late. They caught up and by 19:15 UTC, they were backfilled and brought back to normal.

However, before then, the issue with Kafka disk storage got worse. We feared that with full disks, the entirety of our Kafka cluster would reach an irreparable state (where the only way to free space is to do dangerous untested operations to free storage data), and at 15:09 UTC, we instead chose to turn off ingest entirely with less than 5% of disk space left. Ingest would turn itself off if the disk were to be filled, so we elected to keep recovery simpler by cutting traffic off a few minutes ahead of time.

We quickly shuffled to try and add new Kafka brokers and partitions, which wouldn’t be full, to shift traffic onto them – but before we were done, we thought of disabling our tiered storage. Our Kafka brokers store only a few hours of data locally (“hotset” data), and tier out a longer retention period (2-3 days) to S3. All data written to Kafka is quickly replicated to our storage engine and longer retention is only kept for disaster recovery. We had tried repairing topics, and reducing the size of our hotset for many hours before, but nothing had the desired effect. Our theory was that tiering data was broken anyway, and we were stuck waiting for the offload of data that wouldn’t happen.

At 21:32 UTC, we disabled tiered storage altogether and most of the disk space was recovered. We abandoned the cluster expansion, which was minutes from completion, and at 21:49 UTC, Ingestion of customer data was turned back on. In total, we were unable to accept traffic for 6 hours and 23 minutes.

Our responders were online trying to clean up issues further until roughly 5 am UTC on December 7, and they disbanded to come back during the day. Sunday was mostly spent stabilizing the emergency configuration changes that were made, and the response team, who had worked around the clock since Friday night, was able to take some resting time while our EU cluster was mostly healthy again, aside from the activity log and one of our storage partitions that was unavailable still for querying.

On Monday December 8, employees who weren’t on call over the weekend took over stabilization work. Efforts went to successfully restore the partial querying outage of roughly 1/40th of our storage data. No significant progress was made on improving our activity log feature’s storage, and instead the team increased the storage retention of its changesets to 7 days (the max allowed in our storage engine). Our thinking was that once the Kafka cluster would be stable and these partitions fixed, we could simply replay events and insert everything back.

Tuesday December 9 was spent trying to further stabilize our Kafka cluster and to turn on some features, but we were able to make little progress in salvaging partitions, and started doing corrective work while our Kafka experts tried to see what they could do with the many still damaged metadata and internal topics that were less critical to staying up, but still very important.

On December 10, knowing we only had a few days to salvage Activity Log data, we decided to stop trying to save its Kafka topics, and to instead recreate them. At 19:25 UTC, we found that deletion operations fail, and that our Kafka cluster’s control plane no longer lets us do any manipulation whatsoever aside from listing topics. We can’t delete, create, describe, or mutate any of them. They are fine accepting and transmitting events, but we are essentially unable to administer the cluster anymore.

What we realized at that point in time is that our initial cluster outage severely damaged multiple internal partitions used to manage the cluster itself; as we cut off tiered storage (and as it aged out as well), the damage became more or less irreversible.

We then considered our chances of salvaging the cluster to be rather low, and feared that most small changes (such as assigning a new controller) could collapse the cluster and result in a large outage with data loss. Internal details of some of our applications are relevant here: our storage engine is tightly coupled to Kafka’s own event offsets. To prevent data consistency issues, it will refuse to “roll back” to an older offset, which could indicate a misconfiguration or some other problem that could lead to re-reading, duplicating, or losing data. As such, just “resetting” the cluster was not doable at this point in time, without adding a lot of infrastructure, or writing emergency fixes that significantly change our storage engine’s boot sequences and safety checks.

Later, still on December 10, we decided to split efforts into a) keep trying to save our cluster to the extent it doesn’t risk its stability, and b) start an emergency migration project that requires figuring out how to modify our storage engine, infrastructure, and multiple services to tolerate moving from the damaged Kafka cluster to a new one. We also prepared for multiple contingencies in case the current cluster were to die before the migration was ready to run.

December 11 and December 12 were spent working on this migration at a high priority. Likewise, during that time, our low-risk efforts to salvage the Kafka cluster yielded no great results. As we neared the end of Friday December 12, our retention of activity for the Activity Log also came to an end. In deciding between causing a major outage to rush an evacuation of our Kafka cluster before we were ready or losing days of Activity Log data to wait for a safer migration path, we chose the latter.

On Monday December 15, we managed to boot a new Kafka cluster with new infrastructure, and had all the fixes required for all services to do a migration, along with runbooks from every team involved. We decided to do a “dress rehearsal” in a pre-production environment, where we tried a full evacuation from a functional Kafka cluster to a brand new one, finding what the tricky parts were and making sure that if we were to damage or lose data, it would be internal telemetry, and not customer data.

Fortunately, everything went well, and on Tuesday December 16, we ran the full emergency evacuation. Running it required doing a switchover of all ingest from the old damaged Kafka cluster to the new Kafka cluster. We started by switching components that write data into the cluster first (producers), letting the consumers catch up on all topics and partitions. We then started migrating our consuming services, first with the query engine, and then the other consumers. This order of operation ensured that we would not corrupt, damage, or miss any data, but forced us to have delays on alerting and the freshness of query data. This happened without major issues.

During the migration, we believe our trigger data was stale and runs using incomplete data for roughly 30 minutes, if we average out most partitions. SLOs were delayed for a bit more than one hour, and service maps will have flat out skipped that hour as well. Activity Log data was lost between December 5 at 20:23 UTC and December 9 at 23:45 UTC.

These are added to the 10 days during which Activity Log data was unavailable, the ingestion issues of December 5, the 6h23 minutes of full ingest outage on December 6, and the 18h of delays on SLO processing between December 5 and 6.

At this point in time, we have fully mitigated this incident and developed new processes that promise better readiness to respond if similar outages were to happen again. A more in-depth review will be published in a few weeks, in January. This has been a significant outage, and we will need some time to do a proper analysis of it.

Posted Dec 18, 2025 - 11:11 PST

Resolved

As of 23:23UTC on December 16, the Activity Log has fully caught up and has remained up to date.

All systems are now operational.
Posted Dec 17, 2025 - 07:47 PST

Monitoring

We successfully created and migrated over to a new kafka cluster in the EU. We have resumed consuming events for our SLO and Service Maps features. No further data loss is expected for service maps. SLOs are caught up and are fully synchronized. Activity logs are still delayed and expect to be caught up in about 5 hours. We consider this incident to be mitigated, but will continue to monitor until Activity logs are caught up.
Posted Dec 16, 2025 - 12:17 PST

Update

As of today, our old EU cluster is still functional, yet in a bad state. We do not consider this outage over. However, after running a full evacuation in a pre-production cluster, we are now ready to run a production version of it. This will only be in the impacted EU cluster. We are planning to do it on 6pm UTC December 16th. We expect no ingestion downtime, although we will expect delays on processing events and alerts, will suffer minor data loss in our Service Maps feature, and lose 11 days worth of Activity log events starting the night of December 5th.

We will provide updates throughout the migration.
Posted Dec 15, 2025 - 17:52 PST

Update

We continue to work toward restoring the Activity Log. We’ve made meaningful progress on the underlying issues, but have had to pivot on our remediation plan due to previously unknown hard retention limits. Our engineering team is working through the next steps required to restore Activity Log processing. Next updates to come during Monday US business hours
Posted Dec 12, 2025 - 15:20 PST

Update

We continue to work toward restoring the Activity Log. We’ve made meaningful progress on the underlying issues, but have had to pivot on our remediation plan due to previously unknown hard retention limits. Our engineering team is actively working through the next steps required to restore Activity Log processing.
Posted Dec 11, 2025 - 16:09 PST

Update

We continue to work toward fully restoring the Activity Log. We’ve made meaningful progress on the underlying issues and continue to work through our remediation plan. Our engineering team is actively working through the next steps required to restore Activity Log processing.
Posted Dec 11, 2025 - 10:02 PST

Update

We continue to work toward fully restoring the Activity Log. While functionality has not yet improved, we’ve made progress on the underlying issue and have begun the next phase of remediation. Our engineering team is pausing for the evening and will resume working through the remaining steps required to resume Activity Log processing during US business hours tomorrow. We will provide another update tomorrow morning.
Posted Dec 10, 2025 - 17:08 PST

Update

We continue to work toward fully restoring the Activity Log. While functionality has not yet improved, we’ve made progress on the underlying issue and have begun the next phase of remediation. Our engineering team is actively working through the remaining steps required to resume Activity Log processing.
Posted Dec 10, 2025 - 09:34 PST

Update

We continue to work toward fully restoring the Activity Log. While functionality has not yet improved, we have a clear path to recovery, and remediation work is underway. Our next steps involve coordinated infrastructure changes that will allow us to resume Activity Log processing. We’ll provide another update during US business hours Wednesday.
Posted Dec 09, 2025 - 14:40 PST

Update

We are continuing to work on a fix for this issue.
Posted Dec 08, 2025 - 18:05 PST

Update

We fixed the issue that was causing subsets of temporarily missing data in queries against certain partitions. We continue to work on restoring full functionality to Activity Log. We'll provide another update during US business hours Tuesday.
Posted Dec 08, 2025 - 18:01 PST

Update

We continue to work on restoring full functionality to Activity Log and repair the partition causing certain customers to have subsets of unqueryable data. We will next provide an update during US business hours Tuesday.
Posted Dec 08, 2025 - 17:02 PST

Update

The Activity Log is operational but is experiencing ingestion delays, which means recent activity may not be reflected immediately. We are working to restore full functionality.
Currently, a small subset of data for certain customers may not be queryable due to a partition needing repair. We are working to return the data to queryability.
After investigation of the impact starting on Friday to SLO and Trigger functionality, we're providing the update below with the details.

Overview
SLOs and Triggers were affected by Kafka and ingest outages beginning Friday, December 5 at 8:40 PM UTC. Trigger evaluations are now behaving normally except for some baseline triggers. SLOs with evaluation periods that include missing ingest or are on the affected partition will reflect that currently unqueryable data in their ongoing calculations.
SLOs
- No burn alerts were fired between December 5, 2025 at 8:40 PM UTC and December 6, 2025 at 7:15 PM UTC due to a processing delay.
- Now that ingest is restored, burn alerts should behave normally again minus a subset that are also impacted by the small amount of currently unqueryable data.
Triggers
- When ingest was down on December 6, no events were received, so all EU Trigger evaluations were impacted.
- Baseline Triggers with lookback windows that query data in the affected partition returning will also be affected by the small amount of currently unqueryable data.
- Non-baseline triggers (which can only look back 24 hours) should behave normally going forward.

We will provide an update at the end of the workday (5PM PT) or as the situation changes.
Posted Dec 08, 2025 - 13:52 PST

Update

Our Kafka cluster is still operational and we are continuing to work towards restoring full reliability. We will resume work during business hours to restore full Activity Log functionality. We will continue to update as work progresses.
Posted Dec 07, 2025 - 12:23 PST

Update

The Kafka cluster is still operational, but is not back to full capacity and resiliency. Work will continue tomorrow to ensure sufficient capacity, as well as during business hours to resume full Activity Log functionality. We will provide more updates tomorrow and as-needed if there are any changes to availability.
Posted Dec 06, 2025 - 18:18 PST

Update

We have gotten the Kafka cluster to a stable-enough state, however it will still require remediation before the Activity Log is fully functional and it is back to a level of service sustainable until business hours. We will provide another update in 2 hours or as the situation changes.
Posted Dec 06, 2025 - 16:10 PST

Update

Ingest has been fully re-enabled, and SLOs and Triggers are now up-to-date.
Posted Dec 06, 2025 - 14:08 PST

Identified

We have resumed ingest to api.eu1.honeycomb.io as we were able to restore partial service to the Kafka cluster.
Posted Dec 06, 2025 - 13:51 PST

Update

We are continuing to investigate issues with the Kafka cluster. We will continue providing updates every 2 hours unless there are significant changes.
Posted Dec 06, 2025 - 13:23 PST

Update

We are continuing to investigate issues with the Kafka cluster.
Posted Dec 06, 2025 - 11:20 PST

Update

External API access is restored, event ingestion is still impacted
Posted Dec 06, 2025 - 08:15 PST

Update

We have temporarily disabled event ingestion for the EU Region, in service of restoring full functionality to our EU Kafka fleet. External API access will also be disabled. Additionally, during the ongoing outage, our Service Level Objectives feature has been impacted and down since 12:30 PM Pacific time on Friday, December 5th. SLO data will not be correct until our systems catch up and we rebuild the cache
Posted Dec 06, 2025 - 07:19 PST

Update

Automated systems are still working to catch up and Activity Log will remain offline until that that completes. We will post another update Saturday whether or not the Activity Log outage has been remediated.
Posted Dec 05, 2025 - 23:30 PST

Update

Ingest, querying, SLOs, and Trigger alerting is back to normal. Activity Log is still impacted, we have identified the cause and are working to resolve. We are still evaluating the scope of the outage for ingest and expect to have a full answer for that posted during US business hours on Monday.
We will post at least one more update this evening.
Posted Dec 05, 2025 - 16:04 PST

Update

We are continuing the investigation after business hours to stabilize ingest and determine what work will be needed to fully recover. Known impact will be updated here as the situation changes. We will post at least one more update this evening.
Posted Dec 05, 2025 - 15:46 PST

Update

We have identified that 0.23% of datasets are fully affected, and a larger percentage are seeing intermittent ingestion and query failures (500s at the API level). We are also investigating a replication error in our ingestion pipeline.
Posted Dec 05, 2025 - 14:33 PST

Update

A subset of customer environments may see higher than usual error rates when sending events to api.eu1.honeycomb.io, and notifications for SLOs and Triggers may be delayed for that subset. We are continuing to investigate the issue.
Posted Dec 05, 2025 - 13:19 PST

Investigating

We are currently investigating this issue.
Posted Dec 05, 2025 - 12:49 PST
This incident affected: api.eu1.honeycomb.io - EU1 Event Ingest, ui.eu1.honeycomb.io - EU1 Querying, ui.eu1.honeycomb.io - EU1 Trigger & SLO Alerting, and ui.eu1.honeycomb.io - EU1 Activity Log.