Outbound message delivery is delayed/slow for SparkPost and SparkPost Enterprise customers (US hosted only)

Incident Report for SparkPost

Postmortem

During this incident, outbound message delivery was delayed for all customers provisioned in the US region.

Impact Period: 16:40 Oct 28 - 01:00 Oct 29 UTC

Impact to Customers: No messages were delivered between 16:40 - 19:15 UTC and messages were slow to be attempted for delivery between 19:15 and up to 01:00 Oct 29 UTC until the backlog of queued messages cleared for all customers. Message injection (both REST Transmissions API and SMTP API) was not impacted in this incident and was fully operational for the duration.

This incident was precipitated by a networking issue with our cloud service provider: the network that routes messages to the internet from SparkPost had a failure. However, after our cloud service provider fixed the issue, we did not recover as expected because of a bottleneck in our architecture. It took several hours for the queued messages to be processed resulting in longer time-to-first-attempt metrics. Our corrective actions include: (1) reviewing our architecture both internally and with our providers, (2) building resiliency against the type of network failure that precipitated this event, and, (3) making improvements to accelerate the time to recover for this type of failure to reduce the impact to our customers.

Posted Nov 03, 2021 - 14:28 EDT

Resolved

This incident has been resolved.

Posted Oct 28, 2021 - 21:59 EDT

Monitoring

For the majority of our customers, the backlog of messages has cleared. For a few customers, the queues will be cleared within the next 60 minutes.

Posted Oct 28, 2021 - 21:38 EDT

Update

We continue to work through the backlog of messages. (During this time, you may see delays in the event data streamed via webhooks.)

We will continue to provide regular updates until the backlog of queued messages has cleared.

Posted Oct 28, 2021 - 20:21 EDT

Update

Posted Oct 28, 2021 - 19:19 EDT

Update

Posted Oct 28, 2021 - 18:17 EDT

Update

We are working on accelerating outbound message delivery so that we can work through the backlog more quickly. (During this time, you may see delays in the event data streamed via webhooks.) We will publish a post mortem for this incident at a later date.

We will continue to provide regular updates until the backlog of queued messages has cleared.

Posted Oct 28, 2021 - 17:22 EDT

Identified

We have identified the issue; outbound message delivery has resumed but it will take time for all queued messages to be delivered. (During this time, you may see delays in the event data streamed via webhooks.) We will publish a post mortem for this incident at a later date.

We will continue to provide regular updates until the backlog of queued messages has cleared.

Posted Oct 28, 2021 - 16:41 EDT

Update

We are making progress on identifying root cause and fixing the issue - outbound message delivery has resumed but it is not yet fully operational.

Injection (REST and SMTP) is fully operational and messages will be queued (not lost) and will continue to be attempted.

We will continue to provide regular updates every 30 minutes.

Posted Oct 28, 2021 - 16:14 EDT

Update

We are continuing to investigate this issue and are making progress on identifying root cause. Outbound message delivery is stopped for all customers provisioned in the US region - messages are temporarily failing with "451 4.4.1 [internal] No valid hosts".
Injection (REST and SMTP) is fully operational and messages will be queued (not lost) and will continue to be attempted.

We will continue to provide regular updates every 30 minutes.

Posted Oct 28, 2021 - 15:29 EDT

Update

We are continuing to investigate this issue. Outbound message delivery is stopped for all customers provisioned in the US region - messages are temporarily failing with "451 4.4.1 [internal] No valid hosts".
Injection (REST and SMTP) is fully operational and messages will be queued (not lost) and will continue to be attempted.

We will continue to provide regular updates every 30 minutes.

Posted Oct 28, 2021 - 14:49 EDT

Update

We are continuing to investigate this issue - all hands are on deck. Outbound message delivery is stopped for all customers provisioned in the US region - messages are temporarily failing with "451 4.4.1 [internal] No valid hosts".
Injection (REST and SMTP) is fully operational and messages will be queued (not lost) and will continue to be attempted.

Posted Oct 28, 2021 - 14:03 EDT

Update

Posted Oct 28, 2021 - 13:32 EDT

Investigating

We are investigating an increase in delivery latency for some outbound messages. (NOTE: This does not impact our customers hosted in the EU.)

Posted Oct 28, 2021 - 12:56 EDT

This incident affected: SMTP Delivery (Outbound Message Delivery) (SMTP Delivery - USA).