On August 30, Honeycomb experienced about one hour during which 20% to 50% of queries failed in the US region, across all query types. During this time, queries created with the query builder, boards, trace views, and some trigger evaluations may have failed and returned an error instead of results.
The incident happened during the deployment of a routine upgrade of one of our gRPC libraries, used in a significant portion of our stack. As it rolled out in pre-production and non-public environments, a few minor transient errors were detected, but could not be replicated during more than an hour past deployment. As we deployed it to production environments, the US honeycomb instance started erroring out when communicating with our querying system, and this time it kept failing even when the deployment was completed.
We ended up rolling back the change as the situation grew worse as more hosts were involved, and wasn’t transient; It was triggered more consistently under the heavier load of our production systems.
A later investigation revealed that the issues had to do with a small change in the library (which was gated behind a compile flag) but still touched code elsewhere, such that an optimization operation clashed with our usage of the underlying protobufs library. The protobufs library is used for serialization of data, and while gRPC was functional, our usage of protobuf as part of our querying logic was impacted.
Once the issue was understood, code was modified to be safer with regards to the new library version, and was rolled out without further issue. An internal incident review was conducted, and we do not plan on further external reports at this time.