Tuesday, November 19 11:40 - 13:30 EST
We experienced an increased error rate for jobs in the us-west-1, eu-central-1 and us-east-1 datacenters.
Back pressure from a third party provider caused a delay and build up of requests within the internal service responsible for starting new jobs. This delay caused a crash loop within our service.
Calls to the third party provider experience back pressure were disabled and rerouted to different end points that weren’t experiencing delays.
We’ve modified our services to route traffic via a more resilient path to our third party provider. We’ve also modified our behaviour to drop messages on a queue without waiting for a response to eliminate the chance that back pressure causes our internal services to become overwhelmed. In addition, we are going to add additional monitoring and alerting for these subsystems