System outage
Incident Report for Clerk.io
Postmortem

On Thursday, February 7th and Sunday, February 10th we experienced full outages of our service for approximately 1 hour each.

WHAT HAPPENED

Both incidents related to data corruption in our main data storage. The main reason for the downtime was that we had to check and repair any corrupted data files before we could enable the service again.

We believe the incidents to be related and be caused by faulty hardware underlying our main data store.

WHAT WE DID

As soon as the problems occurred our team was alerted and immediately identified the problem on the datastore. We immediately followed the preparation process and ran the automated tools we had prepared for such an incident.

Both the processes and tools worked flawlessly.

WHAT WE DO NOW

Though it’s impossible to avoid hardware failures we can do more to avoid the impact it can have.

It has been clear to us for some time that we have outgrown our current datastore architecture. That's why we have been working on a completely new architecture from the beginning of the year. We plan on releasing the new architecture in March 2019.

The new architecture will make our service a lot more prone to faulty hardware and in case of a corruption error, only a small subset of stores will be affected instead of the entire service.

We are deeply sorry for any problems and frustrations this downtime have caused.

If you have any questions feel free to contact me directly at hkb@clerk.io

Hans-Kristian

Founder of Clerk.io

Posted 6 months ago. Feb 10, 2019 - 20:27 CET

Resolved
After 24 hours of monitoring, without any issues, we can assure that this issue is gone.

We will update later with a detailed description and action plan.
Posted 7 months ago. Feb 08, 2019 - 11:12 CET
Monitoring
We are up again but are monitoring the situation closely.
Posted 7 months ago. Feb 07, 2019 - 06:03 CET
Identified
We experienced an outage in our data store where corrupt data has forced us to take the entire system offline.

We are currently running a full scan and repair of all data and expect that to be done in 15-30 minutes.
Posted 7 months ago. Feb 07, 2019 - 05:19 CET
This incident affected: API, Data Sync, and Dashboard.