2020-October-14
Postmortem

Dates

Wednesday, October 14, 2020 7:00 pm ET - Thursday, October 15, 2020 1:40 am ET

What happened:

Customers experienced elevated wait times and increased rate of tests unexpectedly terminating from approximately 7:00 pm to 11:00 pm.

Why it happened:

Our GKE cluster underwent automated upgrades during the incident window which caused services to be unexpectedly terminated.

How we fixed it:

Automated cluster upgrade completed and service returned to normal operations.

What we are doing to prevent it from happening again:

We’re adjusting maintenance windows to reduce customer impact. We’re also investigating increasing system robustness to handle cluster interruptions.

Posted Nov 09, 2020 - 11:26 EST

Resolved
The issues is resolved, Headless (us-east-1) testing should be back to normal.
Posted Oct 15, 2020 - 00:58 EDT
Monitoring
Headless tests (us-east-1) have a small chance to error while starting. We've identified the issue and are monitoring, but errors still may occur.
Posted Oct 14, 2020 - 21:42 EDT