We want to first apologize for any inconveniences or problems that these issues caused you. We’d also like to give you preliminary information on the outage on May 24th. While we don’t yet fully understand the root cause, we want to share what we know now.
From approximately 11 AM until 4 PM US/ET we had extensive delivery problems and service availability issues due to the cascading effect of DNS query failures. During this period of time our customers observed:
Regular public DNS queries (from our own DNS infrastructure) were not being answered at a reasonable rate through the AWS network. We are awaiting further information from AWS to explain why this happened. We will update you when we have a full understanding of the root cause when we publish the finalized RCA.
Initially we attempted to address query performance by increasing DNS server capacity fivefold but it did not perform as we expected. Subsequently, we repointed DNS services for the vast majority of our customers to use local AWS nameservers which had sufficient capacity. (For a small number of our customers, DNS services continue to be provided by our DNS infrastructure.) With these measures in place, service was fully restored for all customers by 6 pm EST. We do plan to move back to leveraging our own DNS infrastructure (AWS’ recommended model) pending a full understanding of this outage and necessary changes.
We are working closely with AWS to better understand what happened and to take further corrective actions. In addition, our Engineering teams are exploring various ways to isolate SparkPost and our customers from upstream/bandwidth DNS failures, including:
These follow up actions are top priority for us to ensure we do not have a repeat incident.