2020-November-19 Service Incident
Postmortem

Dates:

Tuesday, November 19 11:40 - 13:30 EST

What happened:

We experienced an increased error rate for jobs in the us-west-1, eu-central-1 and us-east-1 datacenters.

Why it happened:

Back pressure from a third party provider caused a delay and build up of requests within the internal service responsible for starting new jobs. This delay caused a crash loop within our service.

How we fixed it:

Calls to the third party provider experience back pressure were disabled and rerouted to different end points that weren’t experiencing delays.

What we are doing to prevent it from happening again:

We’ve modified our services to route traffic via a more resilient path to our third party provider.  We’ve also modified our behaviour to drop messages on a queue without waiting for a response to eliminate the chance that back pressure causes our internal services to become overwhelmed.  In addition, we are going to add additional monitoring and alerting for these subsystems

Posted Nov 24, 2020 - 12:42 EST

Resolved
Error rates have subsided. All services are fully operational
Posted Nov 19, 2020 - 14:39 EST
Monitoring
We have taken remedial action and are seeing improvements. We are monitoring.
Posted Nov 19, 2020 - 13:45 EST
Investigating
We are seeing a high error rate on automated headless tests. We are investigating.
Posted Nov 19, 2020 - 13:02 EST