API Slowdown + Full Storage Zone 3 Outage
Incident Report for Clerk.io
Postmortem

Yesterday we had two related incidents that affected the availability of Clerk.io.

  1. An API slowdown that affected all stores running on Clerk.io. Our normal avg response time rose from 8.7ms over half an hour to 200ms. This lasted from 16:25 to 17:35 CET. On top of that had to instances of response times above 1sec between 17:50-17:30 and again 18:50-19:10.
  2. A full outage of all stores on our Storage Zone 3 that lasted from 17:15 to 21:20.

This means that stores running on Clerk.io had 2 very different experiences. The vast majority only experienced slowdowns on/off between 16:25 and 19:10 and for the vast majority of this time, everything was operating normally.

The stores located on Storage Zone 3 experienced a total outage from 17:20 - 21:20.

Below follows a detailed description of what happened and what we learned.

What happened?

Around 16:00 we started to receive a series of heavy queries. By heavy queries, we mean queries that need to compare a lot of internal metrics and thus takes up a lot of internal resources. This is completely normal and our operations team monitors these daily.

But at 16:25 we saw a sharp increase in volume and thus all API servers went instantly to 100% CPU usage. This is what caused the initial slowdown.

Our operations team then started narrowing down to the source of these requests. But while focused on this issue another issue grew and was overlooked: the RAM usage of the server running Storage Zone 3 grew drastically. At 17:20 it ran our of RAM and crashed.

In our architecture individual stores are placed in individual zones exactly for this case. If something happens to something in a single zone we can isolate that and keep the rest f Clerk.io running smoothly.

And that was exactly what we did. As soon as zone 3 crashed we knew the source of the problem was there. We immediately follow our playbook for this scenario closing that zone down bringing all other zones back to normal operation with normal response times.

So at 17:30, we had zone 3 down but all other zones running smoothly again.

Now when a database crashed like this you will get data corruption. We are prepared for this scenario and immediately run our automated software for detecting and fixing any issues. This took around 1.5 hours to complete so around 18:50 we were ready to open up for zone 3 again.

Unfortunately, there were still some tables that even though seemed fine to the automated tools did not work properly when coming back in production causing yet another slowdown in response times.

We quickly took zone 3 our again restoring normal operations for all other zoned. Now we started 2 concurrent plans: we would simultaneously run the software check and repair software again and manually check all tables and we would spin up a completely new server restoring zone 3 from our backup. Whatever would be completed first would be launched.

Due to the data size in the zone, it again took 1.5 hours to complete. The data check and repair were completed first and we ran a full manual check on all tables and fixing the 2-3 errors we found.

While waiting we had built a small feature allowing our operations team to take a single store out of the system instead of a full zone. This meant that the second time we could open up for stores in small batches and ongoingly checking if everything was running smoothly.

We started opening up at 21:00 and had opened up for 90% of all stores in zone 3 by 21:15. When opening up the remaining 10% we starter seeing the same issues again and quickly closed the last batch again. We know the root cause was in that last set of stores.

Through slowly examining those stores we had all but one up and running by 21:20.

The remaining single store was moved to an isolated environment and was fully operational at 22:30.

What we learned

  1. We learned that our zoning strategy effectively kept the vast majority of stores safe. It’s the first time it has been used in a real incident and it worked like clockwork.
  2. We also learned that our zones are too big - the amount of time it takes for even big machines to check or restore the data in a zone is simply too long. The majority of the downtime was just all of us sitting anxiously and waiting for the machined to crunch data. We will immediately start to reorganize into smaller zones.
  3. Even tough zones are an efficient mitigation strategy we need to be more nuanced. 99% of the stores in the zone did not need to be affected. We implemented a rough system for quickly taking individual stores within the zone in and out of maintenance mode. This will be built into our std operation procedures for any future incidents.
  4. We have automated bots for monitoring incoming requests and automatically dealing with most problems before they arise. This special case was not caught by this system. Based on our learnings we will train our bots better to prevent this from happening again.
  5. We will add a special monitoring zone for all stores our operations team see behaving strangely so they immediately can be isolated from the rest before they cause any problems.
Posted Apr 21, 2020 - 09:38 CEST

Resolved
Everything is now back online and running smoothly.

A postmortem will follow tomorrow when we have a full overview of the situation.
Posted Apr 20, 2020 - 21:54 CEST
Update
We are opening up for customers on the last database server now.

We take the rollout one step at a time.
Posted Apr 20, 2020 - 21:10 CEST
Update
We are still working on getting the final database server up and running.

In the meantime we have in parallel started setting up a clone from the backup.

Whichever works first will be launched.
Posted Apr 20, 2020 - 19:43 CEST
Update
We are still working on opening up for the stores on the last database server.

We are narrowing in on the root cause and hope to get this final server up and running again soon.
Posted Apr 20, 2020 - 19:21 CEST
Update
We have isolated the issue.

This means that all stores not running on one particular database server are now all running smoothly.

We have to do some further work on the stores on the last server before opening up so they have temporarily all been suspended while we work.
Posted Apr 20, 2020 - 18:01 CEST
Update
We have been able to mitigate some of the incoming requests allowing some throughput but still with slow response times.
Posted Apr 20, 2020 - 17:39 CEST
Update
We are continuing to monitor for any further issues.
Posted Apr 20, 2020 - 17:33 CEST
Update
We are being flooded with incoming API requests causing all API servers to use 100% CPU.

We are working on blocking the source of the request flood.
Posted Apr 20, 2020 - 17:28 CEST
Monitoring
We are experiencing slower API response time at the moment.

We are monitoring the situations and working on identifying the source.
Posted Apr 20, 2020 - 17:09 CEST
This incident affected: API.