Resolved
Sep 11 at 16:07 CEST

Post Mortem - 2025-09-11 - Intermittent API failures

All times are CEST.

Overview

Earlier today logging, search, and recommendation API endpoints began exhibiting intermittent failures, not returning any results.

We’re sorry for the disruption this caused to you and your customers.

Impact

Affected services: logging, search (v2), and recommendation (v2) APIs
Symptoms: intermittent errors responses and missing results
Windows: 08:52–09:25 CEST and ~12:20 (brief)

Timeline

08:19 - A slow increase in incoming live updates (and CRUD requests) begin. The increase is masked by the rise of requests that happen every morning.
08:52 - Our inter-service messaging and event system reaches its memory limit, causing the first failed requests. As messages are consumed, capacity is freed up, leading to the intermittent nature of the issue.
09:02 - The issue is confirmed in monitoring and SRE begins investigating.
09:10 - Engineering confirms the issue in the messaging and event system and begins raising resources to increase event processing.
09:25 - Issue is resolved and intermittent failures cease.
12:20 - Another severe spike in incoming live updates cause another few minutes of intermittent request failures.

Root Cause Analysis

The root cause was the lack of available resources in the messaging and event system.
A transient increase in load from live-updates led to it quite simply hitting a hardware limit and stopped accepting new events.

Parts of the API rely on this system to perform session-, message-, user activity-, and usage-logging. With the system refusing new messages and events these parts of the API failed.

Remediation

Several avenues of remediation are being pursued, most have already been implemented:
1. Completed: The parts of the API that rely on being able to hand over events to the messaging and event system have been modified to 'degrade gracefully'. That means that in the case our infrastructure has issues, you will still get search results and recommendations back from us, instead we disable the tracking of usage and other logging events.
2. Completed: Increased resources for the messaging and event system. We have increased the amount of resources in the system eight-fold (8x), allowing for the absorption of much larger transient loads.
3. Planned for tonight: Separation of specific event types. We will be separating some of the event types out to another more resilient system, isolating it from large spikes in live updates and increasing its durability from hardware errors.

We sincerely apologize for the inconvenience this has caused not just to you, but your customers and want to reassure you that we are not done looking into ways we can further strengthen our systems.

Updated
Sep 11 at 09:29 CEST

The issue has been resolved.

We are continuing close monitoring and working on a root cause analysis.

Complete post mortem will follow.

Created
Sep 11 at 09:03 CEST

We are currently experiencing intermittent failures to requests for search and recommendations.

Engineering is working to restore regular service.