Today we experienced what we first thought was a complete outage of all Clerk.io services.
Further looking in the to issue identified that thought it was a significant outage around half of all requests were coming through and all background processes syncing and analyzing data were unaffected.
Here is a timeline of the events:
- At 08:29 UTC our Redis datastore experienced a failure. We use this service for semi-persistent storage especially for our email service. This lead to most of our API workers waiting for Redis and thus not being available to serve other requests slowing down our.
- At 08:30 our master alarm went off and alerted our OPS team.
- At 08:34 we had identified our Redis setup as the likely source of the issue.
- At 08:38 we had confirmed that Redis was in fact the issue and started to isolate its use form the system to quickly get back to a operating state.
- At 08:43 we had successfully isolated our Redis usage and our service was again fully operational.
- At 09:02 we had fixed and tested our Redis infrastructure and it was placed back in to service restoring everything back to normal.
To prevent a similar issue we will now get a full overview of the technical details that led to the Redis outage and from those learning make our service more robust.