All Systems Operational
api.honeycomb.io - Event Ingest ? Operational
90 days ago
99.7 % uptime
Today
ui.honeycomb.io - App Interface ? Operational
ui.honeycomb.io - Querying ? Operational
90 days ago
99.89 % uptime
Today
ui.honeycomb.io - Trigger & SLO Alerting Operational
90 days ago
99.79 % uptime
Today
www.honeycomb.io ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Past Incidents
Oct 4, 2022

No incidents reported today.

Oct 3, 2022
Resolved - This incident has been resolved.
Oct 3, 12:50 PDT
Update - Our systems appear to be stabilized in the short-term, and are monitoring the issue until we return to full capacity. We're also investigating long-term improvements in order to increase reliability of the affected services under appreciable amounts of load.
Oct 3, 12:39 PDT
Update - SLOs and trigger evaluations are up to date, but we are continuing to monitor excess load on the database and are still stabilizing the system.
Oct 3, 12:04 PDT
Monitoring - We are preparing to run a backfill on SLO notifications for the affected time period. Any SLO burn alerts that would otherwise have triggered during this period may trigger during the backfill.
Oct 3, 11:13 PDT
Update - Trigger evaluations and notifications are functional again, but SLO reporting is still delayed.
Oct 3, 10:26 PDT
Identified - We have disabled SLO evaluations in order to catch up on trigger evaluations, which has initially had a positive effect.
Oct 3, 10:23 PDT
Update - Trigger evaluations, notifications, and alerts appear to all be impacted and may not be running.
Oct 3, 10:05 PDT
Investigating - We are currently investigating a delay in the evaluation of SLOs and trigger execution.
Oct 3, 09:52 PDT
Oct 2, 2022

No incidents reported.

Oct 1, 2022

No incidents reported.

Sep 30, 2022
Resolved - On September 28, we received alerts that two of our query instance hosts on one of our partitions had died. Since they died close after each other, we initially suspected a problem with some data that needed processing, and tried restarting the process - which failed - and then tried replacing one of the hosts, which also failed.
We then thought the issue could have been due to corrupted or unprocessable data in our Kafka stream, and tried various strategies to identify it or restore service. During that time, made the affected partition read-only and redirected the write traffic to our healthy partitions. Before the partition was marked as read-only, around 3% of datasets overall (which includes Honeycomb internal datasets) may have seen delayed data between September 28 at 16:17 UTC and 17:35 UTC (just over 1 hour). None of the data was lost, simply delayed for processing.
All customer datasets are balanced across 3 partitions at a minimum, and gain more with higher ingest volume, so delayed data for that time period may have ranged from 30% on smaller datasets to less than 0.5% on larger ones.
We tried restoring from older snapshots in case any bad data had corrupted disk data on the remaining hosts, until we realized that the process was failing to start because it had reached the maximum limit of open files allowed by the system. It had been set to a relatively high amount for the time, however, the amount of traffic that each partition receives combined with the number of columns (and therefore files) that the application holds open today is much higher than when that limit was set. We were able to hot-patch every running process with the higher limit in order to ensure no other partitions were affected - around 10x what it was before - but we waited to re-enable traffic to the affected partition in order to give our engineers time to thoroughly test a permanent fix that applied to all new instances at boot.
As of today (September 30), we have re-enabled write traffic to the affected partition and have confirmed that the limit is applied to new instances. We’ve also added a signal to our alerting pipeline across the fleet for the number of open files as compared to the limit.

Sep 30, 09:59 PDT
Update - We've confirmed that access to the data and query performance has been restored, but have blocked deployments to the affected service until we return to full capacity.
Sep 28, 15:27 PDT
Update - We estimate it will take 2 hours for the cluster to catch up, after which the affected data will be fully restored. This affects a small subset of queries in certain datasets.
Sep 28, 12:57 PDT
Update - The fix has been rolled out to ensure no other data partitions are affected, but we're mitigating the ripple effects of the initial cause before we can re-enable access to the affected data.
Sep 28, 12:47 PDT
Identified - The issue has been identified and a potential fix is being tested and rolled out to the fleet.
Sep 28, 11:47 PDT
Update - We have restored querying, but customers may see some missing data from a subset of queries. We are actively working on restoring the ability to query that data.
Sep 28, 11:05 PDT
Update - We are continuing to investigate, and are actively working to identify a fix.
Sep 28, 10:55 PDT
Update - We are continuing to investigate, and are actively working to identify a fix.
Sep 28, 10:17 PDT
Investigating - We are investigating elevated errors rates with queries for some customers.
Sep 28, 09:51 PDT
Sep 29, 2022

No incidents reported.

Sep 28, 2022
Sep 27, 2022

No incidents reported.

Sep 26, 2022

No incidents reported.

Sep 25, 2022

No incidents reported.

Sep 24, 2022

No incidents reported.

Sep 23, 2022

No incidents reported.

Sep 22, 2022

No incidents reported.

Sep 21, 2022

No incidents reported.

Sep 20, 2022

No incidents reported.