50ms slower average response time

Incident Report for Clerk.io

Postmortem

These are the technical details for all the nerds that want to know the details of our intense 24 hour debugging adventure.

On Monday 25th at 21:14 UTC we suddenly started to receive high volume (12x normal levels) of data-update requests. This was in itself fine and didn’t affect our performance.

It did though disclose an error in the software layer used to communicate with our caching infrastructure.

Our DevOps team deliberated this issue and though the impact now were minor it could lead to more problems. So we decided that we needed to fix this before Black Friday.

We instantly looked for alternative software and within a few hours we had found a candidate and had integrated in to our core caching library passing all tests with flying colours.

At 00:43 UTC we rolled out our patch and it immediately resolved the software errors we originally set out to solve.

‌

After a few minutes though we started to notice a slow but steady increase in our avg response time.

We deliberated if we should roll the change back but since the increase was stable and we still delivered results for the end user in sub 100ms we decided that this was a better situation compared to having the software bugs we originally fixed.

Over the next 24 hours we did some intense debugging to find the cause of the slowdown. When all components were timed on their own everything was fast. When we ran full request in our developer tools it was fast. But a soon as we put it in a real environment the slowdown occurred.

We tested 9 different theories over the 24 hours but none caused the issue.

‌

Finally we discovered that the slowdown cache-reads immediately following a cache-write to a highly requested key on a persistent network connection. If that specific line of requests occurred the read would be slow.

Normally we run all cache-writes asynchronously so we don’t need to wait for an answer. When a read goes to a highly requested key the cache server has to wait for all reads to complete before performing the write.

The problem occurred when we tried to do a read on the same connection where a write was waiting. All other requests on that connection had to wait for that write.

‌

Luckily the solution was simple. When using persistent connections always have dedicated read connection so no reads have to wait and have a dedicated write connection so stash asynchronously writes on.

‌

That immediately resolved all problems and when rolled out on 21:28 UTC Tuesday 26th all response times returned to our normal 11-12ms average.

Posted Nov 27, 2019 - 10:46 CET

Resolved

Over 24 hours we saw a stable increase in our average response time form 12ms to 62ms.

We still delivered sum 100ms responses to all our main markets so a end user should hardly be able to notice the difference.

The issue was due to a internal networking issue that has now been identified and resolved.

The technical details are available in the Postmortem.

Posted Nov 27, 2019 - 09:23 CET

This incident affected: API.