Major Outage
Incident Report for Clerk.io
Postmortem

Last evening, Clerk.io experienced the biggest incident in our company’s history, with a full outage lasting from 18:17 to 21:28 Central European Time (CET). In this postmortem, I will go over what happened, what we did and how we will prevent this from happening again.

WHAT HAPPENED

At 18:17 CET our team received a general alert, meaning that the core service had stopped responding to incoming requests.

We immediately started to investigate the source and at 18:19 we had identified the problem as a series of deadlocked processes in our datastore.

At 18:24 we had further identified the cause of this issue as corruption of our data store and that a full repair was needed.

WHAT WE DID

Since uptime and stability is one of our primary features, we work very hard to prevent it, but we have also developed a series of plans and automated tools for any thinkable scenario.

Data corruption is a scenario we have made many preparations for, so the next step was to follow the plan and start our automated repair tool.

So we did the following:

  1. Shut down all services immediately to prevent further corruption.
  2. Started the automated repair software.
  3. Kept communication clear and direct, and updated status.clerk.io with the progress roughly every 10-15 minutes or when there was relevant news.
  4. After the repair, we ran a deep scan to confirm that everything was in order.
  5. Ran a stress test on the whole system.
  6. Started the separate services one by one and lastly opened up for the incoming API requests again.

Everything went exactly as planned since this was a well-rehearsed scenario.

The majority of the time was used for the repair (2 hours) and the deep scan (1 hour) and both were 100% automated processes, run by our software.

WHAT WE DO NOW

The most important thing with outages is learning how to prevent the same mistake from happening again.

The main stress on the system, that lead to the datastore corruption, was caused by a single new client that accounted for roughly 20% of all our traffic on their own.

Even our biggest brands take up less than 1% of our combined traffic and thus so much volume on a single store instance was too much. In order to maintain the stability of our service, we have taken them off the service for now.

This is what we have learned and what will do to increase our stability in the future:

  • Never take on a new client if they take up more than 1% of our combined traffic.
  • We have rearranged our development roadmap to focus more on stability and speed of recovery rather than new features for the coming months.
  • We will keep planning and practice for these kinds of emergencies - that really pays of in a real crisis situation.

 

PS. THANK YOU :-)

This has been the single biggest crisis in our company’s history and one of the most horrifying things I could personally imagine.

You have all been extremely understanding and encouraging throughout the outage, which means the world when something like this happens.

This was not a good situation for anyone but I’m really thankful for your understanding and this not only testifies to the great job our support team does every day, but also to all the great customers we have and people we work with.

For that - I thank you from the bottom of my heart :-)

If you have any questions feel free to contact me directly at hkb@clerk.io

Hans-Kristian

Founder of Clerk.io

Posted over 1 year ago. Jul 05, 2017 - 12:44 CEST

Resolved
We have now been monitoring the system closely for 12 hours and everything has been running smoothly without interruptions or signs of problems.
Posted over 1 year ago. Jul 05, 2017 - 11:07 CEST
Monitoring
WE ARE BACK!! :D

All systems are operational again.

We will be monitoring everything closely. This was the worst outage we have ever experienced but all our emergency plans worked like clockwork.

We will follow up tomorrow with an in-depth post-mortem tomorrow.

We know that this can not happen again and that this is a major issue for your business. Despite that, we would like to thank you for all the encouragement you gave us during these last 3 dreadful hours!
Posted over 1 year ago. Jul 04, 2017 - 21:45 CEST
Update
We are now slowly starting the service up :-)
Posted over 1 year ago. Jul 04, 2017 - 21:28 CEST
Update
The last checks take a bit longer than the others but we are 95 % done and have begun preparations to start the service.
Posted over 1 year ago. Jul 04, 2017 - 21:14 CEST
Update
Checks are almost done and everything looks good. We will soon start to power up the service.
Posted over 1 year ago. Jul 04, 2017 - 20:57 CEST
Update
Checks are 70% done.
Posted over 1 year ago. Jul 04, 2017 - 20:47 CEST
Update
Checks are 50% done.
Posted over 1 year ago. Jul 04, 2017 - 20:40 CEST
Update
Checks 30% done.
Posted over 1 year ago. Jul 04, 2017 - 20:27 CEST
Update
Now we are running the final checks.
Posted over 1 year ago. Jul 04, 2017 - 20:22 CEST
Update
The final repairs are the "heavy" damages so they take a bit longer than anticipated but are being restored correctly.
Posted over 1 year ago. Jul 04, 2017 - 20:17 CEST
Update
We found some final fixes we have to make before checking again. The are being processed now.
Posted over 1 year ago. Jul 04, 2017 - 20:07 CEST
Update
We are now running the final checks and fixes.
Posted over 1 year ago. Jul 04, 2017 - 20:05 CEST
Update
The repair is almost complete. After that, there is only a final check and if everything goes well we can open up for the service again.
Posted over 1 year ago. Jul 04, 2017 - 19:57 CEST
Update
The repair is still running but is nearing completion. After this, we will run an extra check of everything before opening up for the service again.
Posted over 1 year ago. Jul 04, 2017 - 19:35 CEST
Update
The repair is now more than halfway done. After the repair, we will run our diagnostics suite to verify that everything is ok before opening up for incoming requests.
Posted over 1 year ago. Jul 04, 2017 - 19:20 CEST
Update
The repair is still progressing as planned.
Posted over 1 year ago. Jul 04, 2017 - 19:07 CEST
Update
The repair is running as planned. Now there is only to wait until it is complete. We will update here periodically.
Posted over 1 year ago. Jul 04, 2017 - 18:51 CEST
Identified
There is a disk corruption in our data store. We are running a full repair now and will update here every 5 minutes.
Posted over 1 year ago. Jul 04, 2017 - 18:39 CEST
Investigating
We are experiencing a major outage and are looking into it. We will be back with updates shortly.
Posted over 1 year ago. Jul 04, 2017 - 18:35 CEST