Degraded performance

Incident Report for Clerk.io

Postmortem

Earlier today we experienced degraded performance on our core API. We managed to resolve the issue with minor service impact.

‌

The core issue

At 13:20 UTC our database servers saw incoming queries growing fast. Within 30 minutes traffic grew to 10x the normal load. Our reliability team were on it early on but identifying the root cause was not straight forward due to no clear pattern.

Following our std. procedure we disabled all non-essential subsystems to direct all resources to our core functionality.

This means that all syncs and non-essential features were disabled.

By 14:18 UTC we identified a failed cache instance that returned corrupted data. That meant that it didn’t show up on our monitoring as problematic since it responded but since the data was wrong it meant that all requests were directed to the databases.

Once identified the issue was resolved by 14:24.

All subsystems were reactivated and running by 14:34.

‌

Side-effects on product filters

The biggest performance impact came from several product filters accidentally being re-set to match 0 products resulting in several recommendations using filters not showing any products.

This was a side-effect of our reliability team trying to save resources on the database servers by disabling filters. Accidentally this meant that filters instead of matching any product matched no products.

We identified this issue at 14:43 UTC and had it resolved by 14:51 UTC.

‌

What we learned

Though our reliability team managed to keep most of the service running uninterrupted for end consumers we still had several issues that could have been handled better.

Every time we have any issue we add it to our playbook so we can handle it next time without problems.

Here is what we will do differently from now on:

We have added more metrics to our cache infrastructure such that these can be identified by the right root cause earlier.
We will implement an std. emergency mode to run our service in which in this case would have helped us resolve this issue without any interruption.

Posted Sep 09, 2019 - 17:05 CEST

Resolved

We have identified the issue and should be fully operational within 10 minutes.

Posted Sep 09, 2019 - 15:31 CEST

Monitoring

Due to a spike in our incoming traffic we are running on degraded performance.

All core services are operational but data updates and statistics have been slowed down or disabled completely until we are fully on top of the situation.

Posted Sep 09, 2019 - 15:06 CEST

This incident affected: API and Data Sync.