An investigation into a gradual burn into our SLOs revealed that one of the redundant instances behind one of our storage partitions suffered a hardware fault that caused it to remain online and successfully ingesting traffic, but without being adequately responsive to data queries. The hardware fault occured on 9/11 at 16:50 UTC, but lower amounts of traffic hid the issue until greater platform load earlier today, from 12:30 UTC until 18:15 UTC.
Customers whose dataset spans this partition and queried it during that time span may have seen degraded query times or failing results intermittently.
We believe that roughly 3% of triggers running every minute may have failed during 12:30 and 18:15 UTC. We have no record of longer triggers being significantly affected.
The problem should now be resolved.
Sep 13, 06:30 PDT