Summary
On May 23 2019 a monitoring agent was updated on US and EU production environments outside of our standard Change Management process. This update was not suitable for production and caused many essential customer-facing services to go offline including the injection APIs, outbound message delivery and open/click redirection. It took time to understand the issue, determine the mitigation plan and fully restore all services for all customers.
Impact Period for EU: Thursday May 23 2019 21:14 - 23:59 UTC
Impact Period for US: Thursday May 23 2019 21:22 - 23:14 UTC
Impact
During the impact period:
– Transmissions API and other APIs were only intermittently available (customers received 5xx errors and timeouts)
– SMTP Injection API was only intermittently available (customers received 4xx errors and timeouts)
– Outbound message delivery was delayed (customers observed slow message delivery)
– Users observed intermittent errors in the UI while accessing reports and other features
– Opens were not recorded consistently
– Clicks were not redirecting recipients consistently
Root Cause
A new version of a monitoring agent was obtained from SparkPost’s vendor and uploaded into a package repository that was incorrectly set to automatically update production systems. The monitoring agent had a performance problem that caused a large number of SparkPost servers to become unresponsive. The SparkPost Operations team was not aware that the update had gone directly to production and it took almost an hour to diagnose the situation. Additionally, each affected server needed to be manually rebooted and the monitoring agent disabled, and this process took an additional 2 hours to fully recover all services in both regions.
Corrective Actions
Our customers depend on SparkPost’s services for their critical business operations and we know that this type of incident can severely impact marketing and transactional email programs. We are committed to learning from this incident by making improvements to prevent this from happening again and by reducing the time it takes to restore services. We have identified the following corrective actions:
– Centralize production package management activities into a single team led within the SparkPost Operations department. [COMPLETE]
– Audit all package repositories to ensure auto syncing is disabled to prevent this from happening again while this incident is under review. [COMPLETE]
– Ensure there are additional safeguards so that updates to packages are never deployed to production without going through the standard Change Management process. [IN PROGRESS]
– Evaluate structural changes to the Security and Engineering teams to facilitate better communication and coordination during incidents. [IN PROGRESS]