Querying Errors
Incident Report for Honeycomb
On September 28, we received alerts that two of our query instance hosts on one of our partitions had died. Since they died close after each other, we initially suspected a problem with some data that needed processing, and tried restarting the process - which failed - and then tried replacing one of the hosts, which also failed.
We then thought the issue could have been due to corrupted or unprocessable data in our Kafka stream, and tried various strategies to identify it or restore service. During that time, made the affected partition read-only and redirected the write traffic to our healthy partitions. Before the partition was marked as read-only, around 3% of datasets overall (which includes Honeycomb internal datasets) may have seen delayed data between September 28 at 16:17 UTC and 17:35 UTC (just over 1 hour). None of the data was lost, simply delayed for processing.
All customer datasets are balanced across 3 partitions at a minimum, and gain more with higher ingest volume, so delayed data for that time period may have ranged from 30% on smaller datasets to less than 0.5% on larger ones.
We tried restoring from older snapshots in case any bad data had corrupted disk data on the remaining hosts, until we realized that the process was failing to start because it had reached the maximum limit of open files allowed by the system. It had been set to a relatively high amount for the time, however, the amount of traffic that each partition receives combined with the number of columns (and therefore files) that the application holds open today is much higher than when that limit was set. We were able to hot-patch every running process with the higher limit in order to ensure no other partitions were affected - around 10x what it was before - but we waited to re-enable traffic to the affected partition in order to give our engineers time to thoroughly test a permanent fix that applied to all new instances at boot.
As of today (September 30), we have re-enabled write traffic to the affected partition and have confirmed that the limit is applied to new instances. We’ve also added a signal to our alerting pipeline across the fleet for the number of open files as compared to the limit.
Posted Sep 30, 2022 - 09:59 PDT
We've confirmed that access to the data and query performance has been restored, but have blocked deployments to the affected service until we return to full capacity.
Posted Sep 28, 2022 - 15:27 PDT
We estimate it will take 2 hours for the cluster to catch up, after which the affected data will be fully restored. This affects a small subset of queries in certain datasets.
Posted Sep 28, 2022 - 12:57 PDT
The fix has been rolled out to ensure no other data partitions are affected, but we're mitigating the ripple effects of the initial cause before we can re-enable access to the affected data.
Posted Sep 28, 2022 - 12:47 PDT
The issue has been identified and a potential fix is being tested and rolled out to the fleet.
Posted Sep 28, 2022 - 11:47 PDT
We have restored querying, but customers may see some missing data from a subset of queries. We are actively working on restoring the ability to query that data.
Posted Sep 28, 2022 - 11:05 PDT
We are continuing to investigate, and are actively working to identify a fix.
Posted Sep 28, 2022 - 10:55 PDT
We are continuing to investigate, and are actively working to identify a fix.
Posted Sep 28, 2022 - 10:17 PDT
We are investigating elevated errors rates with queries for some customers.
Posted Sep 28, 2022 - 09:51 PDT
This incident affected: ui.honeycomb.io - US1 Querying and ui.honeycomb.io - US1 Trigger & SLO Alerting.