On April 9, event ingestion and the Honeycomb UI were unavailable in EU1 from 22:00:10 to 22:11:30 UTC. We’d like to share a bit more about what went wrong and the next steps we plan to take.
We frequently deploy fixes and improvements to our systems as part of our regular work to improve Honeycomb. Our systems run on Kubernetes, and we deploy changes by terminating pods and replacing them with new ones, one at a time. Normally, this is a completely uneventful process, as new pods pick up where the old ones left off, processing customer requests. However, in this case, traffic was not properly forwarded to the new pods, even though they were working correctly.
Once the last of the old pods was terminated, service was abruptly interrupted. This occurred both for our event ingestion service and for the service handling our web-based user interface, roughly at the same time.
When a new pod is started, a component called the AWS Load Balancer Controller registers it with the Application Load Balancer (ALB), and the ALB then begins forwarding traffic to the new pod. During this incident, the AWS Load Balancer Controller failed, and therefore new pods were not registered with the ALB. Once this happened, the next deployment caused a service failure.
Earlier on April 9, we began deploying a routine system update to our Kubernetes cluster. Despite extensive testing in our development and staging environments, the system update interacted poorly with other components of our cluster, causing the AWS Load Balancer Controller and other pods to fail sporadically. We restarted the AWS Load Balancer Controller to restore service, and we also rolled back the system update to prevent recurrence.
With service restored, we have now turned our attention to gaining a better understanding of the exact cause of the failure, both in order to prevent recurrence and to allow us to safely deploy the system update. Now that we know that the AWS Load Balancer Controller is such a critical component, we’ll add monitoring and alerting to ensure that an on-call engineer is made aware of a failure quickly. This will allow us to preemptively pause deployments and prevent the kind of service interruption we saw on April 9.