Search and Recommendations issues.
Resolved
Aug 29 at 13:14 CEST
Postmortem - 2025-08-29
Overview
Earlier today, the search, recommendation, and chat services at api.clerk.io
began malfunctioning, returning an InternalError
with the message: "An internal error occurred. If you need assistance, please contact our support team."
Timeline:
All times CEST.
- 09:29 - A dependency update from our APM provider (Application Performance Monitoring) was staged for test and deployment. It contained no behavior changes on our part; dependency updates like these are routinely performed without issue.
- 09:31 - The update entered the testing pipeline and underwent unit and integration tests.
- 09:43 - The full testing pipeline was completed with no issues, and a blue/green deployment was started in production. A full set of servers was spun up with the new APM dependency.
- 09:44 - All new instances reported ready and began taking fractional traffic.
- 09:49 - No errors were detected in the new instances, and the old instances began spinning down, increasing traffic to the patched instances.
- 09:50 - An internal Redis server handling data queues showed a rapid—but not critical—rise in memory usage.
- 09:51 - To avoid crashing, the Redis instance began refusing new write commands. This led to several services that rely on this queue system malfunctioning in unexpected ways.
- Due to the structure of our internal and external monitoring, these issues occurred in a "gap" between automatic monitors and were not reported.
- External monitors use an endpoint that does not rely on visitor or message logging, so they did not report the error.
- Internal monitoring observed Redis closing connections but not out-of-memory conditions, and did not alert.
- ~10:20 - The issue was reported, and internal escalation began.
- 10:34 - The issue was identified as stemming from either of two updates, and both were staged for rollback.
- 10:47 - All code changes were rolled back, and normal operation was restored.
Remediation:
Several measures have been and will be taken to prevent similar issues in the future:
- The endpoint used for external monitoring will be modified to be more thorough and to "fail" earlier.
- External monitoring will be strengthened with additional requests that actively exercise different features, allowing us to detect issues in specific subsystems.
- Internal policy and training for error reporting and escalation will be updated to ensure faster responses to errors reported outside our established monitoring.
- The issue cannot yet be reproduced in development; we will work with the APM vendor to reproduce and isolate the behavior in a controlled environment and will update this document with a confirmed root cause.
Affected services
Updated
Aug 29 at 10:47 CEST
We did a rollback and will investigate further.
Affected services
Created
Aug 29 at 10:01 CEST
Doing internal updates of dependencies, we encountered a problem that affected some parts of our search and recommendations.
We are very sorry for the inconvenience
Affected services