Minor Outage
Incident Report for Clerk.io
Postmortem

Today we experienced what we first thought was a complete outage of all Clerk.io services.

Further looking in the to issue identified that thought it was a significant outage around half of all requests were coming through and all background processes syncing and analyzing data were unaffected.

Here is a timeline of the events:

  • At 08:29 UTC our Redis datastore experienced a failure. We use this service for semi-persistent storage especially for our email service. This lead to most of our API workers waiting for Redis and thus not being available to serve other requests slowing down our.
  • At 08:30 our master alarm went off and alerted our OPS team.
  • At 08:34 we had identified our Redis setup as the likely source of the issue.
  • At 08:38 we had confirmed that Redis was in fact the issue and started to isolate its use form the system to quickly get back to a operating state.
  • At 08:43 we had successfully isolated our Redis usage and our service was again fully operational.
  • At 09:02 we had fixed and tested our Redis infrastructure and it was placed back in to service restoring everything back to normal.

To prevent a similar issue we will now get a full overview of the technical details that led to the Redis outage and from those learning make our service more robust.

Posted Aug 25, 2019 - 10:32 CEST

Resolved
Everything is now back to normal.
Posted Aug 25, 2019 - 10:08 CEST
Monitoring
We have migrated the issue and everything is now operational again.

We are monitoring the situation.
Posted Aug 25, 2019 - 09:48 CEST
Identified
We have an issue in our Resid cluster and are trying to mitigate it to make the rest of the service available.
Posted Aug 25, 2019 - 09:46 CEST
Investigating
We are experiencing a full-service outage. We are investigating the issue and will be back shortly.
Posted Aug 25, 2019 - 09:39 CEST
This incident affected: API, Data Sync, and Dashboard.