Elevated 5xx API error rates and increased latency for some SparkPost and SparkPost Enterprise customers (US and EU customers)

Incident Report for SparkPost

Postmortem

Summary

On May 23 2019 a monitoring agent was updated on US and EU production environments outside of our standard Change Management process. This update was not suitable for production and caused many essential customer-facing services to go offline including the injection APIs, outbound message delivery and open/click redirection. It took time to understand the issue, determine the mitigation plan and fully restore all services for all customers.

Impact Period for EU: Thursday May 23 2019 21:14 - 23:59 UTC

Impact Period for US: Thursday May 23 2019 21:22 - 23:14 UTC

Impact

During the impact period:

– Transmissions API and other APIs were only intermittently available (customers received 5xx errors and timeouts)

– SMTP Injection API was only intermittently available (customers received 4xx errors and timeouts)

– Outbound message delivery was delayed (customers observed slow message delivery)

– Users observed intermittent errors in the UI while accessing reports and other features

– Opens were not recorded consistently

– Clicks were not redirecting recipients consistently

Root Cause

A new version of a monitoring agent was obtained from SparkPost’s vendor and uploaded into a package repository that was incorrectly set to automatically update production systems. The monitoring agent had a performance problem that caused a large number of SparkPost servers to become unresponsive. The SparkPost Operations team was not aware that the update had gone directly to production and it took almost an hour to diagnose the situation. Additionally, each affected server needed to be manually rebooted and the monitoring agent disabled, and this process took an additional 2 hours to fully recover all services in both regions.

Corrective Actions

Our customers depend on SparkPost’s services for their critical business operations and we know that this type of incident can severely impact marketing and transactional email programs. We are committed to learning from this incident by making improvements to prevent this from happening again and by reducing the time it takes to restore services. We have identified the following corrective actions:

– Centralize production package management activities into a single team led within the SparkPost Operations department. [COMPLETE]

– Audit all package repositories to ensure auto syncing is disabled to prevent this from happening again while this incident is under review. [COMPLETE]

– Ensure there are additional safeguards so that updates to packages are never deployed to production without going through the standard Change Management process. [IN PROGRESS]

– Evaluate structural changes to the Security and Engineering teams to facilitate better communication and coordination during incidents. [IN PROGRESS]

Posted May 28, 2019 - 16:16 EDT

Resolved

This incident is now resolved. We will publish a post-mortem.

Posted May 23, 2019 - 20:38 EDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 23, 2019 - 20:12 EDT

Update

We are continuing to work on a fix for this issue.

Posted May 23, 2019 - 19:52 EDT

Update

We are continuing to work on a fix for this issue.

Posted May 23, 2019 - 19:26 EDT

Identified

The issue has been identified and a fix is being implemented.

Posted May 23, 2019 - 18:49 EDT

Update

We are continuing to investigate this issue.

Posted May 23, 2019 - 18:22 EDT

Update

We are experiencing a widespread outage for customers hosted in the US and EU: outbound message delivery, APIs, UI and opens/clicks are impacted. We are continuing to investigate.

Posted May 23, 2019 - 18:10 EDT

Update

We are continuing to investigate this issue.

Posted May 23, 2019 - 17:50 EDT

Update

We are continuing to investigate this issue.

Posted May 23, 2019 - 17:44 EDT

Investigating

We are experiencing an elevated level of API and UI errors and latency for our APIs for some SparkPost and SparkPost Enterprise customers (including Transmissions API and SMTP injection API). Please retry any 5xx error.
Note: This issue does not impact our SparkPost Enterprise customers hosted in the US.

Posted May 23, 2019 - 17:25 EDT

This incident affected: Metrics API (Metrics API - USA, Metrics API - EUROPE), Transmissions API (Transmissions API - USA, Transmissions API - EUROPE), Events API (Events API - USA, Events API - EUROPE), SparkPost Application (WebUI) (SparkPost Application - USA, SparkPost Application - EUROPE), SMTP API (SMTP API - USA, SMTP API - EUROPE), Sending Domains API (Sending Domains API - USA, Sending Domains API - EUROPE), SMTP Delivery (Outbound Message Delivery) (SMTP Delivery - USA, SMTP Delivery - EUROPE), Relay Webhooks Delivery Service (Relay Webhooks - USA, Relay Webhooks - EUROPE), Event Webhook Delivery Service (Event Webhooks - USA, Event Webhooks - EUROPE), and Engagement Tracking Service (Engagement Tracking Service (redirecting clicks and serving opens) - USA, Engagement Tracking Service (redirecting clicks and serving opens) - EUROPE).