On Thursday, February 7th and Sunday, February 10th we experienced full outages of our service for approximately 1 hour each.
Both incidents related to data corruption in our main data storage. The main reason for the downtime was that we had to check and repair any corrupted data files before we could enable the service again.
We believe the incidents to be related and be caused by faulty hardware underlying our main data store.
As soon as the problems occurred our team was alerted and immediately identified the problem on the datastore. We immediately followed the preparation process and ran the automated tools we had prepared for such an incident.
Both the processes and tools worked flawlessly.
Though it’s impossible to avoid hardware failures we can do more to avoid the impact it can have.
It has been clear to us for some time that we have outgrown our current datastore architecture. That's why we have been working on a completely new architecture from the beginning of the year. We plan on releasing the new architecture in March 2019.
The new architecture will make our service a lot more prone to faulty hardware and in case of a corruption error, only a small subset of stores will be affected instead of the entire service.
We are deeply sorry for any problems and frustrations this downtime have caused.
If you have any questions feel free to contact me directly at hkb@clerk.io
Hans-Kristian
Founder of Clerk.io