DB Connectivity Issues
Incident Report for Clerk.io
Postmortem

Last Friday, March 28th, we had an instability issue and an complete outage that combined lasted about 45 minutes in the late afternoon (ca. 15:30 - 16:15).

We know that our services are an essential part our your business and that any downtime is unacceptable. For that i am personally truly sorry.


What happened is that we simply grew out of our infrastructure.

For the past weeks we have had a slow but steadily increase in bugs and performance issues culminating in this outage. All incidents looked completely independent but now its obvious that they all were caused by us hitting the limitations of our infrastructure. It was not until we had database tables that started to crash that it became clear that this was not a software problem but an infrastructure problem.

After realising this we immediately halted all non essential processing jobs in order to keep the core service uninterrupted. We chose to shut everything down for 10 minutes while we repaired the damaged database tables. This was some very, very long 10 minutes but it was the fastest way to get everything operational again.

We immediately started working on a new infrastructure which was fully implemented yesterday afternoon. We have both increased the raw computing power and capacity of all servers and done some structural rearrangements to improve stability and prevent future outages.

A core change is that the subsystem delivering real-time results such as recommendations or search has been completely isolated from the rest of our system. Some of the immediate results is that the processing time of real-time results have gone down from 40ms to only 15ms on average. This also makes it allot easier to prevent these incidents in the future since every subsystem now runs on its own separate hardware.

Again, i am truly sorry for the frustration this outage has caused.

Hans-Kristian Bjerregaard
CEO & Founder

Posted Mar 30, 2014 - 17:26 CEST

Resolved
Everything is now running again. We will continue with a limit to the import for the rest of the weekend.
Posted Mar 28, 2014 - 16:42 CET
Monitoring
We are back again in protected mode and monitoring the situation. This means that all services work but data import rates have been limited.
Posted Mar 28, 2014 - 16:14 CET
Update
A few accounts are still having problems. They are being fixed now.
Posted Mar 28, 2014 - 15:50 CET
Identified
Our DB is not accepting all connections at the moment.
Posted Mar 28, 2014 - 15:33 CET