Emergency Database Migration

Incident Report for Clerk.io

Postmortem

tl;dr: Last night we experienced what can only be described as a full meltdown from our core hosting provider.

After fighting several independent issues on our hosting provider’s end, we decided to restore from our previous nightly backup on a new set of servers.

The full restore took less than 2 hours, but the total downtime was a total of 9 hours - longer than all previous downtimes combined!

This is totally unacceptable and we will immediately migrate our whole service away from this hosting provider.

Full Postmortem

Last night, we experienced the worst ever outage of Clerk.io. I can not express how truly sorry I am for this to happen. Downtime of just a few minutes is not acceptable, but hours is unthinkable. That being said, our IT team put in a fantastic effort and was on top of the situation from start to finish.

It started with our IT-team receiving automated alerts at 23:53 CET yesterday. They quickly identified the main database as the cause for the errors.

But just 10 minutes after receiving the initial alert, our hosting provider went completely down due to a DDoS attack, cutting us off from both the servers and the backend that allows us to manage them.

At 02:46 CET we were able to access servers and the hosting backend systems. But only moments after this, we received an automated message that the hardware was damaged on the physical host and our hosting provider started to migrate the disks to another host.

The migration was completed at 03:36 CET where we could start working on the server and by 03:44 CET we thought we had solved the problem. We opened access to our API and declared the problem solved.

But we quickly realised that something was completely wrong. The service worked, but would sometime hang and many of our real time features were not working properly.

After an hour of debugging our system, we found that it was the physical disk that was damaged and thus returned corrupt data for some sectors. At this moment we decided to close the service again and just restore everything from our nightly backup on a fresh set of servers.

This process was started at 07:17 CET and everything went smoothly when following our restore procedure.

The servers were set up and the restore completed by 09:17 CET and after checking everything we started the service at 09:24 CET.

We will (after the team has slept) immediately begin to move everything we have away from this hosting provider to a infrastructure where the physical hardware is completely abstracted away. This was the plan for Q1 2016 but we decided to do it now so our IT team can sleep comfortably again.

Again, I can not express how sorry I am. We provide an extremely business critical service, where uptime and trust is of the utter most importance.

I would like to thank the customers we have been in contact with for your understanding and encouragement and wish everybody a happy New Year.

If there is anything we didn't answer, please don't hesitate to contact us at support@clerk.io - we are always here to help you.

Hans-Kristian Bjerregaard Founder & CEO, Clerk.io

Posted Dec 30, 2015 - 11:17 CET

Resolved

We are back up again!

We will do some checks to make sure everything is running smoothly.

Posted Dec 30, 2015 - 09:24 CET

Update

We are now performing manual checks...

Posted Dec 30, 2015 - 09:17 CET

Update

Less than 10% left of the data migration.

After the migration is complete we will make a quick manual check and then open up the system.

Posted Dec 30, 2015 - 09:02 CET

Update

Data migration is now more than 50% complete.

After the data migration we will perform a quick systems check before starting all services.

We expect to be up and running fully between 9:15-9:30 CET.

Posted Dec 30, 2015 - 08:36 CET

Monitoring

We are expecting the migration to be completed before 9:00 CET.

Posted Dec 30, 2015 - 08:12 CET

Identified

We have to move all our data to a new service now. This will take 45-60 minutes.

The service will not be available while we make this shift.

Posted Dec 30, 2015 - 08:06 CET