Event Details:
On May 19th, 2020 between 14:15 UTC and 15:35 UTC, certain subsystems of Area 1 experienced sporadic mail delays for a short period of time.
Root Cause & Actions
The delay was due to our image analysis system not terminating properly after processing. The issue affected roughly 15% of our email processing infrastructure with all other nodes unaffected by the incident. The team was alerted to increased queue sizes on the affected processing nodes and took immediate action to resolve the issue. About 6.6% of total email messages delivered to the Area 1 system experienced a delay of over 1 minute. No message took longer than 14 minutes to be processed during the incident. All other messages were processed within an acceptable time range. All messages were successfully queued so no messages were lost during the event.
Additionally, the Area 1 engineering team has developed and deployed a solution to ensure our image analysis system does not cause message processing delays in the future. We have also adjusted our processes to better alert customers while the event is occurring vs. after the event.