Datastore Outage
Incident Report for Clerk.io
Postmortem

Yesterday, Clerk.io experienced our second biggest incident and the second in this month with an outage lasting from 14:34 to 17:09 Central European Time (CET). I can only say that it's dreadful to sit here for the second time in a month writing a postmortem like this. But in this postmortem, I will go over what happened, what we did and how we will prevent this from happening again.

WHAT HAPPENED

At 14:34 CET our team received a general alert, meaning that the core service had stopped responding to incoming requests.

We immediately started to investigate the source but due to the holiday season and a late lunch break, it took more than 10 minutes before any technician was near a computer that could access our service (this will be important later).

The cause was identified as corrupt data in our data store and we initially assumed that this was the same error as earlier this month (spoiler: it wasn't).

WHAT WE DID

Since we just experienced a similar incident (we thought) we followed the procedure from the last incident:

  1. Shut down all services immediately to prevent further corruption.
  2. Started the automated repair software.
  3. Kept communication clear and direct, and updated status.clerk.io with the progress roughly every 30 minutes or when there was relevant news.
  4. After the repair, we ran a deep scan to confirm that everything was in order.
  5. Ran a stress test on the whole system.
  6. Started the separate services one by one and lastly opened up for the incoming API requests again.

But when we reached step 4 we started noticing something some data points were completely damaged and needed to be restored from a backup. This was not like last time.

But at first, we just focused and getting the service up again as fast as possible.

WHAT WE FOUND OUT

After getting the service back online yesterday we immediately started to investigate what caused the issue. The incident from last time should not be able to happen again since we took immediate precautions after that!

After some deep digging in our server logs, we found something "interesting". The system log file indicated that the master data store had been rebooted! Or rather we noticed that the log files contained log messages for the boot sequence but none for the shutdown sequence.

This is the clear signature of someone pulling the plug on the physical hardware!

Normally any part of or service and data store can be rebooted and reconnect with the rest of the service without causing any problems. But since this was a direct power loss some data was heavily corrupted as the machine died mid-write.

WHAT WE DO NOW

We run our core service on AWS to abstract away from managing hardware. We are now working closely with the AWS technical team to figure out how this could happen.

Also what took the most time in both incidents was checking and repairing corrupted data. We have an automated tool for this and have just finished an improved version (today, unfortunately) that is up to 32 times (2*CPUs) faster on the same dataset. Should we ever need to check for corruption this should only take a few minutes instead of hours.

Based on this incident (and this) we have made a 3-tiered plan to improve the stability of Clerk.io: 1. In the July and August we will (and have) made many smaller improvements to our data store and recovery tools. This means less stress on the data store to increase stability in general and faster recovery tools. 2. In September we will make some larger changes to our data store to increase its stability. 3. In Q1 2018 we plan to move the entire data store to a fully managed service provided that AWS can guarantee means we can avoid "pulling the plug" incidents like the one we just experienced.

 

If you have any questions feel free to contact me directly at hkb@clerk.io

Hans-Kristian

Founder of Clerk.io

 

Update: 7th August 2017

Amazon has now investigated the incident and come back with this reply confirming a technical issue in their hardware caused the issue:

Hello,

This is Lisa from AWS, thank you for taking the time to reach out to us.

The EC2 Team have investigated and found that the underlying host on which your ec2 instance was running experienced a transient issue that affected reach ability of your instance. The issue was resolved and host was recovered automatically.

Please note that we make every possible effort to ensure Amazon Web Services is highly available and resilient, We also have automated processes in place to warn us of potential failures but in some cases the hardware fails before any warning can be triggered which is what happened here.

We are sorry for any inconvenience caused by this issue, please do reply back with any further questions or concerns.

Best regards,

Lisa M. Amazon Web Services

Posted Jul 29, 2017 - 15:45 CEST

Resolved
Everything is running smoothly.

We are working throughout the weekend to be ready to deploy our new data storage set up as fast as possible.

A full postmortem will be available tomorrow.
Posted Jul 28, 2017 - 17:54 CEST
Monitoring
We are back online and are monitoring the situation.
Posted Jul 28, 2017 - 17:10 CEST
Update
Datastore check up is almost complete and we will initiate the final repairs shortly.
Posted Jul 28, 2017 - 16:36 CEST
Identified
We are fixing a data store outage.

We will periodically update this page with updates.
Posted Jul 28, 2017 - 15:23 CEST