Elevated 5xx API error rates and increased latency for some SparkPost and SparkPost Enterprise customers (EU hosted only)
Incident Report for SparkPost
Postmortem

Summary:

On May 7 2019, during routine operational maintenance on SparkPost EU, the central API service tier experienced an outage. This prevented many APIs from working for 15 minutes: Transmissions, Tracking Domains, Sending Domains, Templates, Inbound Domains and Relay Webhook APIs. Additionally, the engagement tracking service failed to record opens and clicks and click-through redirects didn’t work -- meaning that links did not work for recipients who clicked a link in an email during this incident. The Status Page incident did not explicitly state that engagement tracking was impacted.

Impact Period: Tuesday 7 May 2019 15:40 - 15:55 UTC

Root Cause:

A key software component that handles incoming HTTP traffic for APIs and engagement tracking failed to restart because of an invalid configuration. This caused the APIs to return 502 errors to any HTTP requests, including link redirects. Additionally, our alerting framework made it difficult to identify all the impacted components - especially engagement tracking.

Corrective Actions:

At SparkPost we take these kinds of incidents very seriously. We know that our email service is critical for our customers and we know that it’s imperative that we give full, accurate and timely information regarding operational incidents. To that end, we have identified a number of corrective actions to address the gaps that surfaced in this incident:

** Automate the relevant deployment process to ensure that a misconfigured node cannot be put into production

** Isolate engagement tracking inhouse alerting services from API alerting services

** Create a separate Status Page component for engagement tracking

Posted about 2 months ago. May 08, 2019 - 13:00 EDT

Resolved
This incident has been resolved.
Posted about 2 months ago. May 07, 2019 - 12:05 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted about 2 months ago. May 07, 2019 - 11:56 EDT
Update
We are continuing to investigate this issue.
Posted about 2 months ago. May 07, 2019 - 11:55 EDT
Investigating
We are experiencing an elevated level of API and UI errors and latency for our APIs for some SparkPost and SparkPost Enterprise customers (including Transmissions API and SMTP injection API). Please retry any 5xx error.
Note: This issue does not impact our SparkPost Enterprise customers hosted in the US.
Posted about 2 months ago. May 07, 2019 - 11:47 EDT
This incident affected: IP Pools API (IP Pools API - EUROPE), SparkPost Application (WebUI) (SparkPost Application - EUROPE), Transmissions API (Transmissions API - EUROPE), Message Events API (Message Events API - EUROPE), Tracking Domains API (Tracking Domains API - EUROPE), Relay Webhooks Delivery Service (Relay Webhooks - EUROPE), Sending IPs API (Sending IPs API - EUROPE), Sending Domains API (Sending Domains API - EUROPE), Metrics API (Metrics API - EUROPE), SMTP API (SMTP API - EUROPE), and Templates API (Templates API - EUROPE).