API Downtime | Clerk.io

At 09:01 we began sending out a large email-newsletter for a customer, using new high-capacity email infrastructure we have been building to increase email throughput and reliability.

At 09:04 the first emails began being opened, and traffic increased from open- and click-tracking.

At 09:07 we the combined traffic had increased the amount of traffic to one of our customer database-shards to increase almost hundred-fold over the regular traffic and it began slowing down as it reached it’s concurrency limits. This was roughly 50% higher than the highest previously observed spike from email sending.

The slowdown on this database-shard meant that the API began slowing down as well, and the automatic systems began adding new API servers to keep up with the demand.

At 09:14 the PagerDuty team was alerted the API could not scale fast enough, with most of it’s workers tied up waiting for the over-loaded database-shard.

At 09:20 the issue had been diagnosed and work started to recover the service.

At 09:23 the database-shard and API service was fully recovered.

The API was intermittently unavailable for a 9 minute window, with some request coming through, but most was rejected by the load-balancer due to lack of capacity.

The customers on the affected shard have experienced intermittent availability for ~16 minutes.