Minor Outage

Incident Report for Clerk.io

Postmortem

Today we experienced what we first thought was a complete outage of all Clerk.io services.

Further looking in the to issue identified that thought it was a significant outage around half of all requests were coming through and all background processes syncing and analyzing data were unaffected.

‌

Here is a timeline of the events:

At 08:29 UTC our Redis datastore experienced a failure. We use this service for semi-persistent storage especially for our email service. This lead to most of our API workers waiting for Redis and thus not being available to serve other requests slowing down our.
At 08:30 our master alarm went off and alerted our OPS team.
At 08:34 we had identified our Redis setup as the likely source of the issue.
At 08:38 we had confirmed that Redis was in fact the issue and started to isolate its use form the system to quickly get back to a operating state.
At 08:43 we had successfully isolated our Redis usage and our service was again fully operational.
At 09:02 we had fixed and tested our Redis infrastructure and it was placed back in to service restoring everything back to normal.

‌

To prevent a similar issue we will now get a full overview of the technical details that led to the Redis outage and from those learning make our service more robust.

Posted Aug 25, 2019 - 10:32 CEST

Resolved

Everything is now back to normal.

Posted Aug 25, 2019 - 10:08 CEST

Monitoring

We have migrated the issue and everything is now operational again.

We are monitoring the situation.

Posted Aug 25, 2019 - 09:48 CEST

Identified

We have an issue in our Resid cluster and are trying to mitigate it to make the rest of the service available.

Posted Aug 25, 2019 - 09:46 CEST

Investigating

We are experiencing a full-service outage. We are investigating the issue and will be back shortly.

Posted Aug 25, 2019 - 09:39 CEST

This incident affected: API, Data Sync, and Dashboard.