Query and trigger degradation

Incident Report for Honeycomb

Resolved

An investigation into a gradual burn into our SLOs revealed that one of the redundant instances behind one of our storage partitions suffered a hardware fault that caused it to remain online and successfully ingesting traffic, but without being adequately responsive to data queries. The hardware fault occured on 9/11 at 16:50 UTC, but lower amounts of traffic hid the issue until greater platform load earlier today, from 12:30 UTC until 18:15 UTC.

Customers whose dataset spans this partition and queried it during that time span may have seen degraded query times or failing results intermittently.

We believe that roughly 3% of triggers running every minute may have failed during 12:30 and 18:15 UTC. We have no record of longer triggers being significantly affected.

The problem should now be resolved.

Posted Sep 13, 2021 - 06:30 PDT