Yesterday, Clerk.io experienced our second biggest incident and the second in this month with an outage lasting from 14:34 to 17:09 Central European Time (CET). I can only say that it's dreadful to sit here for the second time in a month writing a postmortem like this. But in this postmortem, I will go over what happened, what we did and how we will prevent this from happening again.
At 14:34 CET our team received a general alert, meaning that the core service had stopped responding to incoming requests.
We immediately started to investigate the source but due to the holiday season and a late lunch break, it took more than 10 minutes before any technician was near a computer that could access our service (this will be important later).
The cause was identified as corrupt data in our data store and we initially assumed that this was the same error as earlier this month (spoiler: it wasn't).
Since we just experienced a similar incident (we thought) we followed the procedure from the last incident:
But when we reached step 4 we started noticing something some data points were completely damaged and needed to be restored from a backup. This was not like last time.
But at first, we just focused and getting the service up again as fast as possible.
After getting the service back online yesterday we immediately started to investigate what caused the issue. The incident from last time should not be able to happen again since we took immediate precautions after that!
After some deep digging in our server logs, we found something "interesting". The system log file indicated that the master data store had been rebooted! Or rather we noticed that the log files contained log messages for the boot sequence but none for the shutdown sequence.
This is the clear signature of someone pulling the plug on the physical hardware!
Normally any part of or service and data store can be rebooted and reconnect with the rest of the service without causing any problems. But since this was a direct power loss some data was heavily corrupted as the machine died mid-write.
We run our core service on AWS to abstract away from managing hardware. We are now working closely with the AWS technical team to figure out how this could happen.
Also what took the most time in both incidents was checking and repairing corrupted data. We have an automated tool for this and have just finished an improved version (today, unfortunately) that is up to 32 times (2*CPUs) faster on the same dataset. Should we ever need to check for corruption this should only take a few minutes instead of hours.
Based on this incident (and this) we have made a 3-tiered plan to improve the stability of Clerk.io: 1. In the July and August we will (and have) made many smaller improvements to our data store and recovery tools. This means less stress on the data store to increase stability in general and faster recovery tools. 2. In September we will make some larger changes to our data store to increase its stability. 3. In Q1 2018 we plan to move the entire data store to a fully managed service provided that AWS can guarantee means we can avoid "pulling the plug" incidents like the one we just experienced.
If you have any questions feel free to contact me directly at firstname.lastname@example.org
Founder of Clerk.io
Amazon has now investigated the incident and come back with this reply confirming a technical issue in their hardware caused the issue:
This is Lisa from AWS, thank you for taking the time to reach out to us.
The EC2 Team have investigated and found that the underlying host on which your ec2 instance was running experienced a transient issue that affected reach ability of your instance. The issue was resolved and host was recovered automatically.
Please note that we make every possible effort to ensure Amazon Web Services is highly available and resilient, We also have automated processes in place to warn us of potential failures but in some cases the hardware fails before any warning can be triggered which is what happened here.
We are sorry for any inconvenience caused by this issue, please do reply back with any further questions or concerns.
Lisa M. Amazon Web Services