Last evening, Clerk.io experienced the biggest incident in our company’s history, with a full outage lasting from 18:17 to 21:28 Central European Time (CET). In this postmortem, I will go over what happened, what we did and how we will prevent this from happening again.
At 18:17 CET our team received a general alert, meaning that the core service had stopped responding to incoming requests.
We immediately started to investigate the source and at 18:19 we had identified the problem as a series of deadlocked processes in our datastore.
At 18:24 we had further identified the cause of this issue as corruption of our data store and that a full repair was needed.
Since uptime and stability is one of our primary features, we work very hard to prevent it, but we have also developed a series of plans and automated tools for any thinkable scenario.
Data corruption is a scenario we have made many preparations for, so the next step was to follow the plan and start our automated repair tool.
So we did the following:
Everything went exactly as planned since this was a well-rehearsed scenario.
The majority of the time was used for the repair (2 hours) and the deep scan (1 hour) and both were 100% automated processes, run by our software.
The most important thing with outages is learning how to prevent the same mistake from happening again.
The main stress on the system, that lead to the datastore corruption, was caused by a single new client that accounted for roughly 20% of all our traffic on their own.
Even our biggest brands take up less than 1% of our combined traffic and thus so much volume on a single store instance was too much. In order to maintain the stability of our service, we have taken them off the service for now.
This is what we have learned and what will do to increase our stability in the future:
This has been the single biggest crisis in our company’s history and one of the most horrifying things I could personally imagine.
You have all been extremely understanding and encouraging throughout the outage, which means the world when something like this happens.
This was not a good situation for anyone but I’m really thankful for your understanding and this not only testifies to the great job our support team does every day, but also to all the great customers we have and people we work with.
For that - I thank you from the bottom of my heart :-)
If you have any questions feel free to contact me directly at firstname.lastname@example.org
Founder of Clerk.io