Summary
On Friday, March 13th, some webhook batches were delayed up to 4 hours in our US region. In addition to the delay, some webhook batches were incorrectly retried even after being accepted by customers’ webhook receivers.
Timeline and Impact
March 13th, all times in UTC
Root Cause
A surge in message volume combined with two latent bugs in the webhooks delivery service to cause the data processing system to fall behind. Specifically, one data processing bug severely reduced overall throughput. The other bug caused messages to be removed from the queue only after a number of attempts, which resulted in the batches being sent multiple times even after a webhook consumer responded with an HTTP 200. The potential for these bugs to cause serious issues was not fully understood and no impact had been observed recently.
Corrective Actions