API Slowdown
Incident Report for Clerk.io
Postmortem

Followup on todays stability issues

Overview & Cause

Today, between 13:30 and 14:00 we had a issue with very long response times and timeouts interrupting the stability of the service.

This was caused by internal routing errors in the datacenter holding our main database increasing latency of internal messages from less than 1ms to 250-300ms.

Since the issue were slowdowns but the affected servers were still running and answering our normal emergency mode did not kick in. This system would normally let the service run uninterrupted.

What we have done to prevent this in the future

As soon as the issue were resolved we started working on improvements to our setup to prevent this. We have already rolled out a series of improvements and will continue to do so based on our new learnings.

Most central is the ability for automatic detection of network bottlenecks.


At Clerk.io we know we provide services that are critical to your business and your relationship with your customers.

We are truly sorry for any inconvenience this have caused and thank you for your understanding.

If you have any questions please reach out directly to me at hkb@clerk.io.


Your sincerely

Hans-Kristian
Founder & CEO

Posted Jun 03, 2015 - 23:59 CEST

Resolved
The routing issue has been resolved.

We will follow up with a post mortem later.
Posted Jun 03, 2015 - 14:05 CEST
Identified
We are currently trying to redirect traffic around the bottleneck to free up resources while the technicians are working.
Posted Jun 03, 2015 - 13:53 CEST
Update
It's is located as a routing problem internally in the datacenter.

Our hosting provider are on the issue.
Posted Jun 03, 2015 - 13:41 CEST
Investigating
We are experiencing slowdowns on part of our network.

We are on the case and will be back with updates in a few minutes.
Posted Jun 03, 2015 - 13:35 CEST