Full Outage

Incident Report for Clerk.io

Postmortem

On Thursday, February 7th and Sunday, February 10th we experienced full outages of our service for approximately 1 hour each.

‌

WHAT HAPPENED

Both incidents related to data corruption in our main data storage. The main reason for the downtime was that we had to check and repair any corrupted data files before we could enable the service again.

We believe the incidents to be related and be caused by faulty hardware underlying our main data store.

‌

WHAT WE DID

As soon as the problems occurred our team was alerted and immediately identified the problem on the datastore. We immediately followed the preparation process and ran the automated tools we had prepared for such an incident.

Both the processes and tools worked flawlessly.

‌

WHAT WE DO NOW

Though it’s impossible to avoid hardware failures we can do more to avoid the impact it can have.

It has been clear to us for some time that we have outgrown our current datastore architecture. That's why we have been working on a completely new architecture from the beginning of the year. We plan on releasing the new architecture in March 2019.

The new architecture will make our service a lot more prone to faulty hardware and in case of a corruption error, only a small subset of stores will be affected instead of the entire service.

‌

We are deeply sorry for any problems and frustrations this downtime have caused.

‌

If you have any questions feel free to contact me directly at hkb@clerk.io

Hans-Kristian

Founder of Clerk.io

Posted Feb 10, 2019 - 20:26 CET

Resolved

Everything is operational again. We will update later today with the full status of this incident.

Posted Feb 10, 2019 - 11:10 CET

Update

The last checks are taking a bit longer than expected. We still expect to be done and online within 15 minutes.

Posted Feb 10, 2019 - 10:53 CET

Update

Just a quick update. We are running the last tests now and are expecting to be fully live within 10 minutes.

Posted Feb 10, 2019 - 10:30 CET

Update

We have traced the issue back to a failed server in our data store setup. This, unfortunately, has left some data damaged. We are currently running a full automated scan and repair and expect this to be resolved within 30-45 minutes.

Posted Feb 10, 2019 - 10:00 CET

Identified

We are experiencing a full outage caused by a hardware failure. More update will follow.

Posted Feb 10, 2019 - 09:53 CET

This incident affected: API, Data Sync, and Dashboard.