Search and Recommendations ...

Resolved
Aug 29 at 13:14 CEST

Postmortem - 2025-08-29

Overview

Earlier today, the search, recommendation, and chat services at api.clerk.io began malfunctioning, returning an InternalError with the message: "An internal error occurred. If you need assistance, please contact our support team."

Timeline:

All times CEST.

09:29 - A dependency update from our APM provider (Application Performance Monitoring) was staged for test and deployment. It contained no behavior changes on our part; dependency updates like these are routinely performed without issue.
09:31 - The update entered the testing pipeline and underwent unit and integration tests.
09:43 - The full testing pipeline was completed with no issues, and a blue/green deployment was started in production. A full set of servers was spun up with the new APM dependency.
09:44 - All new instances reported ready and began taking fractional traffic.
09:49 - No errors were detected in the new instances, and the old instances began spinning down, increasing traffic to the patched instances.
09:50 - An internal Redis server handling data queues showed a rapid—but not critical—rise in memory usage.
09:51 - To avoid crashing, the Redis instance began refusing new write commands. This led to several services that rely on this queue system malfunctioning in unexpected ways.
- Due to the structure of our internal and external monitoring, these issues occurred in a "gap" between automatic monitors and were not reported.
- External monitors use an endpoint that does not rely on visitor or message logging, so they did not report the error.
- Internal monitoring observed Redis closing connections but not out-of-memory conditions, and did not alert.
~10:20 - The issue was reported, and internal escalation began.
10:34 - The issue was identified as stemming from either of two updates, and both were staged for rollback.
10:47 - All code changes were rolled back, and normal operation was restored.

Remediation:

Several measures have been and will be taken to prevent similar issues in the future:

The endpoint used for external monitoring will be modified to be more thorough and to "fail" earlier.
External monitoring will be strengthened with additional requests that actively exercise different features, allowing us to detect issues in specific subsystems.
Internal policy and training for error reporting and escalation will be updated to ensure faster responses to errors reported outside our established monitoring.
The issue cannot yet be reproduced in development; we will work with the APM vendor to reproduce and isolate the behavior in a controlled environment and will update this document with a confirmed root cause.

Updated
Aug 29 at 10:47 CEST

We did a rollback and will investigate further.

Created
Aug 29 at 10:01 CEST

Doing internal updates of dependencies, we encountered a problem that affected some parts of our search and recommendations.
We are very sorry for the inconvenience

Search and Recommendations issues.