We have now investigated this outage and have the full picture.
The root cause was a power failure on a physical machine hosting one of our virtual database servers on AWS. This caused data to be damaged and thus creating a domino effect effectively taking out all of our API servers.
By relying on our data zoning framework our ops team was able to bring 92% of all stores back online within just 6 minutes.
The remaining stores in the affected zone needed to be checked before being brought back online - this took time depended on the amount of data in each store and the complexity of the data loss. The last store was back online after 3:40 hours.
12:48 - The physical host machine for one of our databases loses power and immediately shuts down.
12:50 - The machine automatically comes back online but some of the data that was updated during the power failure has been damaged.
12:52 - The damaged data causes process trying to access it to hang and causes a domino effect throughout our API servers where all processes end up waiting for data making our API unavailable.
12:58 - Our operations team has been informed throughout this by our automated monitoring systems and puts all stores on the affected database in maintenance mode. This means that they can no longer be accessed but all other stores on our platform can run smoothly and thus our API is functioning fully again for all other stores. Roughly 8% of all stores had data on this particular database.
12:59 - In order to ensure that we don’t take any store with damaged data back online, we check every store on that database twice and automatically restore any damaged data. Data that could not be recovered by our automated systems are then handled manually. When a store is checked and confirmed to be working it is brought back online.
15:35 - All stores are now operational again.
We have confirmed that our strategy for using different zones to store data works. Even for a major event as a power failure that causes data loss, we are able to mitigate the issue to a single zone and thus protecting all other stores from being affected.
We also confirmed that having automated 99% of the data testing and reparation work meant that we could get most stores in the affected zone back online quickly.
During the outage we learned of a smarter way to more quickly identify the few stores in a zone that was the root cause of a problem meaning that should this happen again we can bring the majority of a zone back online much faster.