By adjusting our service startup procedure to better ensure the process is ready to handle queries before it enters service, the query delays surrounding deploys appear to have been reduced to be below baseline noise. Though the excessive delays were a recent symptom, there had been minor query delays (small enough to remain within our SLO budget) showing up during every deploy for a long time. The recent failures pushed us to take action and resolve this long-standing SLO burn.
Posted May 11, 2023 - 11:23 PDT
While we don’t have a definitive answer explaining this incident yet, we have multiple threads to pull on. Additional changes are queuing up to improve the capacity of the system. Since the system is stable, we are pausing status updates for the time being.
We are leaving this incident open and will provide further updates tomorrow as we resume operational work on this issue.
Posted May 09, 2023 - 14:01 PDT
We have recently been actively re-balancing the capacity of our query engine cluster. Doing so came in with degradation of performance during our deployment windows, which translate themselves into slower queries (regular queries, as well as those from boards and through the API), a small number of partially delayed SLO notifications and a small number of missed triggers (~3%) today. We have stabilized the cluster for now and we are actively seeking ways of managing further capacity changes with fewer customer-facing impacts.
Posted May 09, 2023 - 11:41 PDT
This incident affected: ui.honeycomb.io - US1 Querying and ui.honeycomb.io - US1 Trigger & SLO Alerting.