Some email and webhooks delays

Incident Report for SparkPost

Postmortem

We want to first apologize for any inconveniences or problems that these issues caused you. We’d also like to give you preliminary information on the outage on May 24th. While we don’t yet fully understand the root cause, we want to share what we know now.

Incident Summary

From approximately 11 AM until 4 PM US/ET we had extensive delivery problems and service availability issues due to the cascading effect of DNS query failures. During this period of time our customers observed:

Increased delay rates due to reason "454 4.4.4 [internal] no MX or A for domain".
Slowness and/or 5xx errors in APIs, SMTP injection, and the User Interface.

Why did this happen?

Regular public DNS queries (from our own DNS infrastructure) were not being answered at a reasonable rate through the AWS network. We are awaiting further information from AWS to explain why this happened. We will update you when we have a full understanding of the root cause when we publish the finalized RCA.

What did we do to address the issue?

Initially we attempted to address query performance by increasing DNS server capacity fivefold but it did not perform as we expected. Subsequently, we repointed DNS services for the vast majority of our customers to use local AWS nameservers which had sufficient capacity. (For a small number of our customers, DNS services continue to be provided by our DNS infrastructure.) With these measures in place, service was fully restored for all customers by 6 pm EST. We do plan to move back to leveraging our own DNS infrastructure (AWS’ recommended model) pending a full understanding of this outage and necessary changes.

Follow up

We are working closely with AWS to better understand what happened and to take further corrective actions. In addition, our Engineering teams are exploring various ways to isolate SparkPost and our customers from upstream/bandwidth DNS failures, including:

Deploying a more robust multi-tier DNS architecture.
Providing better local caching options for DNS.
Implementing additional DNS server tuning & monitoring.

These follow up actions are top priority for us to ensure we do not have a repeat incident.

Posted May 26, 2017 - 10:56 EDT

Resolved

Mail is delivering normally and most mail should have been delivered by now. If you continue to have problems please contact Support. We apologize for any inconvenience.

Posted May 24, 2017 - 16:39 EDT

Monitoring

We have mitigated the issues and deliveries have picked up. Any mail that we accepted will be delivered. We continue to monitor for any other issues.

Posted May 24, 2017 - 16:10 EDT

Update

We are continuing to work on the DNS issue. During this period of time there are intermittent failures for various services.

Posted May 24, 2017 - 14:55 EDT

Update

We continue to work on mitigating issues with DNS which are causing intermittent problems for multiple services.

Posted May 24, 2017 - 13:43 EDT

Update

Event Webhooks are delivering normally now, we continue to work on email delays.

Posted May 24, 2017 - 12:50 EDT

Identified

We are working to mitigate DNS issues that are causing some high number of tempfails with reason of "454 4.4.4 [internal] no MX or A for domain".

Posted May 24, 2017 - 12:28 EDT

Investigating

We are investigating some delays, will update shortly.

Posted May 24, 2017 - 11:06 EDT

This incident affected: Transmissions API (Transmissions API - USA), SparkPost Application (WebUI) (SparkPost Application - USA), SMTP API (SMTP API - USA), and SMTP Delivery (Outbound Message Delivery) (SMTP Delivery - USA).