Summary:
On May 7 2019, during routine operational maintenance on SparkPost EU, the central API service tier experienced an outage. This prevented many APIs from working for 15 minutes: Transmissions, Tracking Domains, Sending Domains, Templates, Inbound Domains and Relay Webhook APIs. Additionally, the engagement tracking service failed to record opens and clicks and click-through redirects didn’t work -- meaning that links did not work for recipients who clicked a link in an email during this incident. The Status Page incident did not explicitly state that engagement tracking was impacted.
Impact Period: Tuesday 7 May 2019 15:40 - 15:55 UTC
Root Cause:
A key software component that handles incoming HTTP traffic for APIs and engagement tracking failed to restart because of an invalid configuration. This caused the APIs to return 502 errors to any HTTP requests, including link redirects. Additionally, our alerting framework made it difficult to identify all the impacted components - especially engagement tracking.
Corrective Actions:
At SparkPost we take these kinds of incidents very seriously. We know that our email service is critical for our customers and we know that it’s imperative that we give full, accurate and timely information regarding operational incidents. To that end, we have identified a number of corrective actions to address the gaps that surfaced in this incident:
** Automate the relevant deployment process to ensure that a misconfigured node cannot be put into production
** Isolate engagement tracking inhouse alerting services from API alerting services
** Create a separate Status Page component for engagement tracking