Reporting data delays
Incident Report for SparkPost
Postmortem

Summary

On May 16th between 11AM ET and 7PM ET and again on May 17th between 5PM and 9PM ET data processing was delayed up to 120 minutes. These delays affected the data population for Message Events API and Metrics API endpoints and event Webhook delivery. Message Events and Metrics API endpoint performance was degraded during this time which resulted in slower response times and some 5xx errors. Since API powers the UI users may have experienced slowness, errors, or noticed missing data during these incidents. At the end of the incident, all data processing was caught up and no data was lost. Email delivery was not impacted at all by these incidents.

We apologize for the impact this incident had on you. The following is an explanation of what happened and what we did to resolve.

Why did this happen?

The incident on May 16th was primarily due to performance constraints in our data processing pipeline that were triggered due to organic growth. Our team focused on identifying and resolving those constraints and making other tuning changes. In the course of this work on May 17th a tuning operation was executed on one of our database systems that actually triggered the delays that day.

Follow Up

The team implemented several tuning changes to the system on May 17th and 18th. These have successfully reduced the load on the data processing pipeline and database systems by a significant degree. The performance of the system is now much more efficient than it was previously and data loading and API & UI response times are much improved with no more data processing delays during peak times. We plan to due some additional analysis and may make further tuning changes to improve the performance and reliability based on the lessons learned from this incident.

Posted May 26, 2017 - 15:21 EDT

Resolved
The final issue with webhooks is resolved and should be completely caught up in a few. We are sorry for the inconvenience. If you have further issues please contact Support.
Posted May 16, 2017 - 18:17 EDT
Update
Metrics and Message Events data is now up to date. We are still working on getting webhooks caught up.
Posted May 16, 2017 - 17:42 EDT
Monitoring
Data processing is catching up. It is abut 30 minutes behind but we expect to catch up within the next hour. We identified a problem with the data loading process that contributed to the problem. We are also reviewing and tuning a number of other system settings so we can avoid a repeat of this scenario.
Posted May 16, 2017 - 17:21 EDT
Update
We identified that event webhooks may be delayed as well.
Posted May 16, 2017 - 14:24 EDT
Investigating
There are delays in populating data for the reporting UI, metrics API endpoint, and message events API endpoint. Our operations team is investigating.
Posted May 16, 2017 - 13:54 EDT
This incident affected: Metrics API (Metrics API - USA) and Events API (Events API - USA).