Earlier today we experienced degraded performance on our core API. We managed to resolve the issue with minor service impact.
The core issue
At 13:20 UTC our database servers saw incoming queries growing fast. Within 30 minutes traffic grew to 10x the normal load. Our reliability team were on it early on but identifying the root cause was not straight forward due to no clear pattern.
Following our std. procedure we disabled all non-essential subsystems to direct all resources to our core functionality.
This means that all syncs and non-essential features were disabled.
By 14:18 UTC we identified a failed cache instance that returned corrupted data. That meant that it didn’t show up on our monitoring as problematic since it responded but since the data was wrong it meant that all requests were directed to the databases.
Once identified the issue was resolved by 14:24.
All subsystems were reactivated and running by 14:34.
Side-effects on product filters
The biggest performance impact came from several product filters accidentally being re-set to match 0 products resulting in several recommendations using filters not showing any products.
This was a side-effect of our reliability team trying to save resources on the database servers by disabling filters. Accidentally this meant that filters instead of matching any product matched no products.
We identified this issue at 14:43 UTC and had it resolved by 14:51 UTC.
What we learned
Though our reliability team managed to keep most of the service running uninterrupted for end consumers we still had several issues that could have been handled better.
Every time we have any issue we add it to our playbook so we can handle it next time without problems.
Here is what we will do differently from now on: