Webhook delays for some SparkPost and SparkPost Enterprise customers (US hosted only)
Incident Report for SparkPost
Postmortem

Summary

On Friday, March 13th, some webhook batches were delayed up to 4 hours in our US region. In addition to the delay, some webhook batches were incorrectly retried even after being accepted by customers’ webhook receivers.

Timeline and Impact

March 13th, all times in UTC

  • 13:45 - Webhook delivery queues begin to backup [IMPACT PERIOD BEGINS]
  • 14:22 - Team is alerted to backups and initial troubleshooting and mitigation steps are taken
  • 17:00 - Material progress is made and the backlog begins to clear
  • 23:00 - Webhook delivery queues are back to normal state and data is no longer delayed [IMPACT PERIOD ENDS]

Root Cause

A surge in message volume combined with two latent bugs in the webhooks delivery service to cause the data processing system to fall behind. Specifically, one data processing bug severely reduced overall throughput. The other bug caused messages to be removed from the queue only after a number of attempts, which resulted in the batches being sent multiple times even after a webhook consumer responded with an HTTP 200. The potential for these bugs to cause serious issues was not fully understood and no impact had been observed recently.

Corrective Actions

  • Fix stream processing & batch deletion bugs - DONE
  • Add APM profiling to confirm fixes and troubleshooting future issues - DONE
  • Readjust settings to prior to incident - IN PROGRESS
  • Add a separate delay queue for webhooks so that failed batches will not back up new messages under extreme load & perform additional load test of the production system (Target completion: June 1st)
Posted Mar 19, 2020 - 19:20 EDT

Resolved
All data is up to date. This issue is resolved.
Posted Mar 13, 2020 - 19:22 EDT
Monitoring
We have implemented a fix and data should be caught up. We continue to monitor the situation.
Posted Mar 13, 2020 - 19:11 EDT
Update
Most data backups are resolved. We will update once everything is 100% cleared out.
Posted Mar 13, 2020 - 18:25 EDT
Update
We are continuing to work on a fix for this issue.
Posted Mar 13, 2020 - 18:23 EDT
Update
Some webhook data is still delayed for some customers. We are continuing to work on this issue as a top priority. We apologize for the inconvenience.
Posted Mar 13, 2020 - 16:17 EDT
Identified
We believe we have identified the issue. We have applied a change and continuing to monitor.
Posted Mar 13, 2020 - 14:24 EDT
Investigating
Our webhook data delivery services are running behind and some customers may see a delay in data streamed to their webhook endpoints.
Message injection and outbound message delivery is not impacted - all messages are flowing as expected.
Note: This issue does not impact our SparkPost customers hosted in the EU.
Posted Mar 13, 2020 - 14:06 EDT
This incident affected: Event Webhook Delivery Service (Event Webhooks - USA).