Back to overview
Degraded

Search and Recommendations issues.

Aug 29 at 10:01 CEST
Affected services
api.clerk.io
Search
Recommendations

Resolved
Aug 29 at 13:14 CEST

Postmortem - 2025-08-29

Overview

Earlier today, the search, recommendation, and chat services at api.clerk.io began malfunctioning, returning an InternalError with the message: "An internal error occurred. If you need assistance, please contact our support team."

Timeline:

All times CEST.

  • 09:29 - A dependency update from our APM provider (Application Performance Monitoring) was staged for test and deployment. It contained no behavior changes on our part; dependency updates like these are routinely performed without issue.
  • 09:31 - The update entered the testing pipeline and underwent unit and integration tests.
  • 09:43 - The full testing pipeline was completed with no issues, and a blue/green deployment was started in production. A full set of servers was spun up with the new APM dependency.
  • 09:44 - All new instances reported ready and began taking fractional traffic.
  • 09:49 - No errors were detected in the new instances, and the old instances began spinning down, increasing traffic to the patched instances.
  • 09:50 - An internal Redis server handling data queues showed a rapid—but not critical—rise in memory usage.
  • 09:51 - To avoid crashing, the Redis instance began refusing new write commands. This led to several services that rely on this queue system malfunctioning in unexpected ways.
    • Due to the structure of our internal and external monitoring, these issues occurred in a "gap" between automatic monitors and were not reported.
    • External monitors use an endpoint that does not rely on visitor or message logging, so they did not report the error.
    • Internal monitoring observed Redis closing connections but not out-of-memory conditions, and did not alert.
  • ~10:20 - The issue was reported, and internal escalation began.
  • 10:34 - The issue was identified as stemming from either of two updates, and both were staged for rollback.
  • 10:47 - All code changes were rolled back, and normal operation was restored.

Remediation:

Several measures have been and will be taken to prevent similar issues in the future:

  • The endpoint used for external monitoring will be modified to be more thorough and to "fail" earlier.
  • External monitoring will be strengthened with additional requests that actively exercise different features, allowing us to detect issues in specific subsystems.
  • Internal policy and training for error reporting and escalation will be updated to ensure faster responses to errors reported outside our established monitoring.
  • The issue cannot yet be reproduced in development; we will work with the APM vendor to reproduce and isolate the behavior in a controlled environment and will update this document with a confirmed root cause.

Updated
Aug 29 at 10:47 CEST

We did a rollback and will investigate further.

Created
Aug 29 at 10:01 CEST

Doing internal updates of dependencies, we encountered a problem that affected some parts of our search and recommendations.
We are very sorry for the inconvenience