Cache Infrastructure Down

Incident Report for Clerk.io

Postmortem

Earlier today the Clerk.io API was unavailable for a total of 18 minutes due to a failure in our cache infrastructure increasing average response time from 9 miliseconds to 17 seconds.

The failure was caused by a memory leak in a monitoring application on the central cache server causing the server to rapidly run out of memory.

The monitoring application has been temporarily disabled, a bug reported to the company providing the service and tonight we will do a maintenance update to avoid this from happening in the future.

Here is what happened (times are in UTC):

07:29 - The monitoring application starts to rapidly leak memory.
07:30 - The server crashes due to lack of memory.
07:31 - Without any cache, the database is overloaded and response times soar through the roof. A major alarm is issued.
07:33 - All processing tasks and other compute-heavy tasks are manually disabled to minimize the DB load.
07:34 - The problem is identified to be the cache servers and they are issued to be restarted gracefully (std. procedure to avoid corrupting the system and causing even more downtime).
07:41 Due to the memory overload the graceful restart takes more than 5 minutes.
07:43 The cache servers are powered up and Memcached starts picking up connections.
07:49 The cache is now hot enough for response times to return to a functioning level.
08:00 Everything is back to normal and all processing tasks are again reenabled.

What we will do to prevent this in the future

A single cache server should never be able to take down the entire service. We will look at both having more spare memory on our servers but also looking into cache mirroring services for improved stability.

We know that our service is essential to your business and we deeply apologize for any downtime! We are also thankful for your support during this incident even though we let you down.

I wish you all the very best

Hans-Kristian Bjerregaard

Founder, Clerk.io

Posted Mar 28, 2017 - 14:37 CEST

Resolved

Everything is running smoothly again.

We will look into why our cache infrastructure completely crashed and update with a full post-mortem later today when everything has been analyzed.

Posted Mar 28, 2017 - 10:00 CEST

Monitoring

The cache system is back online and we are back. The system is still warming up but we are getting close to normal response times again.

Posted Mar 28, 2017 - 09:53 CEST

Investigating

The Clerk.io API is really-really slow due to an error on our cache servers.

Posted Mar 28, 2017 - 09:38 CEST