On May 16th between 11AM ET and 7PM ET and again on May 17th between 5PM and 9PM ET data processing was delayed up to 120 minutes. These delays affected the data population for Message Events API and Metrics API endpoints and event Webhook delivery. Message Events and Metrics API endpoint performance was degraded during this time which resulted in slower response times and some 5xx errors. Since API powers the UI users may have experienced slowness, errors, or noticed missing data during these incidents. At the end of the incident, all data processing was caught up and no data was lost. Email delivery was not impacted at all by these incidents.
We apologize for the impact this incident had on you. The following is an explanation of what happened and what we did to resolve.
The incident on May 16th was primarily due to performance constraints in our data processing pipeline that were triggered due to organic growth. Our team focused on identifying and resolving those constraints and making other tuning changes. In the course of this work on May 17th a tuning operation was executed on one of our database systems that actually triggered the delays that day.
The team implemented several tuning changes to the system on May 17th and 18th. These have successfully reduced the load on the data processing pipeline and database systems by a significant degree. The performance of the system is now much more efficient than it was previously and data loading and API & UI response times are much improved with no more data processing delays during peak times. We plan to due some additional analysis and may make further tuning changes to improve the performance and reliability based on the lessons learned from this incident.