Yesterday we had two related incidents that affected the availability of Clerk.io.
This means that stores running on Clerk.io had 2 very different experiences. The vast majority only experienced slowdowns on/off between 16:25 and 19:10 and for the vast majority of this time, everything was operating normally.
The stores located on Storage Zone 3 experienced a total outage from 17:20 - 21:20.
Below follows a detailed description of what happened and what we learned.
Around 16:00 we started to receive a series of heavy queries. By heavy queries, we mean queries that need to compare a lot of internal metrics and thus takes up a lot of internal resources. This is completely normal and our operations team monitors these daily.
But at 16:25 we saw a sharp increase in volume and thus all API servers went instantly to 100% CPU usage. This is what caused the initial slowdown.
Our operations team then started narrowing down to the source of these requests. But while focused on this issue another issue grew and was overlooked: the RAM usage of the server running Storage Zone 3 grew drastically. At 17:20 it ran our of RAM and crashed.
In our architecture individual stores are placed in individual zones exactly for this case. If something happens to something in a single zone we can isolate that and keep the rest f Clerk.io running smoothly.
And that was exactly what we did. As soon as zone 3 crashed we knew the source of the problem was there. We immediately follow our playbook for this scenario closing that zone down bringing all other zones back to normal operation with normal response times.
So at 17:30, we had zone 3 down but all other zones running smoothly again.
Now when a database crashed like this you will get data corruption. We are prepared for this scenario and immediately run our automated software for detecting and fixing any issues. This took around 1.5 hours to complete so around 18:50 we were ready to open up for zone 3 again.
Unfortunately, there were still some tables that even though seemed fine to the automated tools did not work properly when coming back in production causing yet another slowdown in response times.
We quickly took zone 3 our again restoring normal operations for all other zoned. Now we started 2 concurrent plans: we would simultaneously run the software check and repair software again and manually check all tables and we would spin up a completely new server restoring zone 3 from our backup. Whatever would be completed first would be launched.
Due to the data size in the zone, it again took 1.5 hours to complete. The data check and repair were completed first and we ran a full manual check on all tables and fixing the 2-3 errors we found.
While waiting we had built a small feature allowing our operations team to take a single store out of the system instead of a full zone. This meant that the second time we could open up for stores in small batches and ongoingly checking if everything was running smoothly.
We started opening up at 21:00 and had opened up for 90% of all stores in zone 3 by 21:15. When opening up the remaining 10% we starter seeing the same issues again and quickly closed the last batch again. We know the root cause was in that last set of stores.
Through slowly examining those stores we had all but one up and running by 21:20.
The remaining single store was moved to an isolated environment and was fully operational at 22:30.