System outage

Incident Report for Clerk.io

Postmortem

On Thursday, February 7th and Sunday, February 10th we experienced full outages of our service for approximately 1 hour each.

‌

WHAT HAPPENED

Both incidents related to data corruption in our main data storage. The main reason for the downtime was that we had to check and repair any corrupted data files before we could enable the service again.

We believe the incidents to be related and be caused by faulty hardware underlying our main data store.

‌

WHAT WE DID

As soon as the problems occurred our team was alerted and immediately identified the problem on the datastore. We immediately followed the preparation process and ran the automated tools we had prepared for such an incident.

Both the processes and tools worked flawlessly.

‌

WHAT WE DO NOW

Though it’s impossible to avoid hardware failures we can do more to avoid the impact it can have.

It has been clear to us for some time that we have outgrown our current datastore architecture. That's why we have been working on a completely new architecture from the beginning of the year. We plan on releasing the new architecture in March 2019.

The new architecture will make our service a lot more prone to faulty hardware and in case of a corruption error, only a small subset of stores will be affected instead of the entire service.

‌

We are deeply sorry for any problems and frustrations this downtime have caused.

‌

If you have any questions feel free to contact me directly at hkb@clerk.io

Hans-Kristian

Founder of Clerk.io

Posted Feb 10, 2019 - 20:27 CET

Resolved

After 24 hours of monitoring, without any issues, we can assure that this issue is gone.

We will update later with a detailed description and action plan.

Posted Feb 08, 2019 - 11:12 CET

Monitoring

We are up again but are monitoring the situation closely.

Posted Feb 07, 2019 - 06:03 CET

Identified

We experienced an outage in our data store where corrupt data has forced us to take the entire system offline.

We are currently running a full scan and repair of all data and expect that to be done in 15-30 minutes.

Posted Feb 07, 2019 - 05:19 CET

This incident affected: API, Data Sync, and Dashboard.