On Sunday, November 5, we experienced a bit over 1h10m of partially available ingestion, along with roughly 5 minutes of complete ingestion outage. Starting at around 02:15 UTC, customers might have seen event processing from our API ingestion endpoint be slower, often failing in an on-and-off manner, until it stopped entirely for a few minutes. At 02:50 UTC, the system briefly recovered, although it took until 03:30 UTC for it to become fully stable again.
We detected the issue through our standard alerting mechanisms, which notified our on-call engineers of issues with both ingestion performance and stability. Additionally, automated load-shedding mechanisms aiming to maintain system stability were tripped and generated extra notifications.
Despite the load-shedding being in place—with the objective of dropping traffic more aggressively to prevent a cascade of ingest host failures—we found our ingestion fleet in a series of restarts. Our engineers tried to manually and aggressively scale the fleet up to buy it more capacity. We then noticed an interplay with Kubernetes’ crash loop back-off behavior, which took previously failing hosts and kept them offline, which incidentally meant our overall cluster capacity still wasn’t sufficient.
We also saw aggressive retry behavior from some traffic sources looking a bit like a thundering herd, so we cut off most ingest for a few minutes to let us build back the required capacity in ingest host to deal with all the incoming data. We then quickly recovered, and tweaked rate-limiting for our most aggressive sources to stabilize the cluster.
After analysis, we’ve gathered a few clues indicating that this is a variation on previously seen incidents for which we generally had adequate mitigation mechanisms in place, but that happened in a manner that circumvented some of it this time around.
Specifically, what we’ve found out is an abnormally large amount of queries coming from spread out connections within a few minutes. Before auto-scaling could kick in (which may have been slowed down by recent optimizations to our ingest code that shifted its workload a bit), these requests also managed to trigger a lot of database writes that inadvertently trampled each other and bogged down our connection pools. This happened faster than it took our cache—which would circumvent that work—to propagate the writes to all hosts. This, in turn, amplified and accelerated the memory use of our ingestion hosts, until they died.
By the time the hosts came all back, the writes had managed to make it through and the caches automatically refreshed themselves, but we hadn’t yet managed to be stable again. We identified some of our customers sending us more than 20x their usual traffic. We initially thought this would be a backfill, but then started suspecting a surprising retry behavior. Because the OTel protocol can only return a success or a failure for an entire batch of spans but we might drop only some of them, we suspected they would re-send entire batches just to cover a small portion of failures. We temporarily dropped rate-limits to their traffic until it subsided, and the ingestion pipeline as a whole got stable again.
Because the failure mode is relatively well understood, our next step is to focus on determining the projects that will address these failure paths. We have scaled up the fleet to prevent a repeat of this incident while we work on the longer term fixes.