All Systems Operational
api.honeycomb.io - Event Ingest ? Operational
90 days ago
99.87 % uptime
Today
ui.honeycomb.io - Querying ? Operational
90 days ago
99.88 % uptime
Today
ui.honeycomb.io - App Interface ? Operational
www.honeycomb.io ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Past Incidents
Aug 6, 2020

No incidents reported today.

Aug 5, 2020
Resolved - The rollback is complete, and we believe this incident has now been resolved.
Aug 5, 14:24 PDT
Monitoring - We have identified the commit that introduced the bug and have initiated a rollback. Some queries may return partial results until the rollback is complete.
Aug 5, 14:18 PDT
Identified - We believe at this time that only data querying and triggers are impacted; data sent to Honeycomb during this incident has been ingested and will be available for analysis once querying is restored.
Aug 5, 14:09 PDT
Investigating - We are currently investigating an issue impacting querying data via ui.honeycomb.io.
Aug 5, 14:03 PDT
Aug 4, 2020
Resolved - This incident has been resolved. Querying functionality should be working normally.
Aug 4, 15:01 PDT
Monitoring - We have deployed a fix and are monitoring to confirm it is working as expected.
Aug 4, 11:01 PDT
Identified - We have identified the issue and are working on deploying a fix.
Aug 4, 10:31 PDT
Investigating - We are investigating reports of elevated numbers of queries failing.
Aug 4, 10:07 PDT
Aug 3, 2020
Resolved - We experienced elevated numbers of queries failing between for 1.5 hours due to a bad deploy. This has been rolled back and query errors have resolved. There was no data loss during this time
Aug 3, 15:50 PDT
Aug 2, 2020

No incidents reported.

Aug 1, 2020

No incidents reported.

Jul 31, 2020

No incidents reported.

Jul 30, 2020

No incidents reported.

Jul 29, 2020

No incidents reported.

Jul 28, 2020
Resolved - On July 28 starting at approximately 01:30am PDT, Honeycomb began experiencing increased error rates and latency across our services, leading to a partial outage ultimately lasting approximately two hours and impacting 0.002% of ingest traffic over that period. The outage was triggered by AWS Route53 suffering lookup failures from EC2 instances in the us-east-1a availability zone (AZ), which hosts many of our instances across all our services.

Our Ingest Latency SLO burn alert triggered at 01:53am PDT. On-call investigated and used Honeycomb (BubbleUp specifically) to determine that our ingest instances in us-east-1a were to blame. On-call responded by removing that AZ from the autoscaling group, which reacted by replacing the problematic hosts with new ones in unaffected AZs. This migration finished at 03:15am PDT. There was a small amount of ingested event loss due to the outage (roughly 35000 event payloads, 0.002% of our traffic over that period) because retries for sending the events to Kafka were exhausted on the affected ingest hosts.

While ingest is the most important service to fix quickly, this outage also affected other services: We saw elevated response times for our UI and storage service. Some triggers and burn alerts also failed to evaluate during that time window. If you believe you have been impacted by a trigger or alert failing to run during this period, please reach out to our support team at honeycomb.io/support.

Throughout this incident, while degraded, Honeycomb remained usable and we used it to debug the incident.

AWS declared the outage resolved at 04:37am PDT, but we delayed transitioning ingest hosts back to us-east-1a until 09:45am PDT.
Jul 28, 01:30 PDT
Jul 27, 2020

No incidents reported.

Jul 26, 2020

No incidents reported.

Jul 25, 2020

No incidents reported.

Jul 24, 2020

No incidents reported.

Jul 23, 2020

No incidents reported.