Elevated 5xx API error rates and increased latency for some SparkPost and SparkPost Enterprise customers (EU hosted only)
Incident Report for SparkPost
Postmortem

Summary:

On May 7 2019, during routine operational maintenance on SparkPost EU, the central API service tier experienced an outage. This prevented many APIs from working for 15 minutes: Transmissions, Tracking Domains, Sending Domains, Templates, Inbound Domains and Relay Webhook APIs. Additionally, the engagement tracking service failed to record opens and clicks and click-through redirects didn’t work -- meaning that links did not work for recipients who clicked a link in an email during this incident. The Status Page incident did not explicitly state that engagement tracking was impacted.

Impact Period: Tuesday 7 May 2019 15:40 - 15:55 UTC

Root Cause:

A key software component that handles incoming HTTP traffic for APIs and engagement tracking failed to restart because of an invalid configuration. This caused the APIs to return 502 errors to any HTTP requests, including link redirects. Additionally, our alerting framework made it difficult to identify all the impacted components - especially engagement tracking.

Corrective Actions:

At SparkPost we take these kinds of incidents very seriously. We know that our email service is critical for our customers and we know that it’s imperative that we give full, accurate and timely information regarding operational incidents. To that end, we have identified a number of corrective actions to address the gaps that surfaced in this incident:

** Automate the relevant deployment process to ensure that a misconfigured node cannot be put into production

** Isolate engagement tracking inhouse alerting services from API alerting services

** Create a separate Status Page component for engagement tracking

Posted May 08, 2019 - 13:00 EDT

Resolved
This incident has been resolved.
Posted May 07, 2019 - 12:05 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 07, 2019 - 11:56 EDT
Update
We are continuing to investigate this issue.
Posted May 07, 2019 - 11:55 EDT
Investigating
We are experiencing an elevated level of API and UI errors and latency for our APIs for some SparkPost and SparkPost Enterprise customers (including Transmissions API and SMTP injection API). Please retry any 5xx error.
Note: This issue does not impact our SparkPost Enterprise customers hosted in the US.
Posted May 07, 2019 - 11:47 EDT
This incident affected: SparkPost Application (WebUI) (SparkPost Application - EUROPE), Relay Webhooks Delivery Service (Relay Webhooks - EUROPE), Metrics API (Metrics API - EUROPE), Events API (Events API - EUROPE), Sending Domains API (Sending Domains API - EUROPE), SMTP API (SMTP API - EUROPE), and Transmissions API (Transmissions API - EUROPE).