Bluesky experienced a major outage affecting ~50% of users for 8 hours when a newly deployed service sent batches of 15-20K concurrent requests without bounded concurrency, exhausting memcached connections. A cascading failure ensued, with millions of error logs per second triggering excessive OS threads and garbage collection pauses. The immediate fix used randomized source IP spoofing to expand the port space; the root cause was fixed by adding concurrency limits to the problematic RPC endpoint.
Infrastructure
Bluesky April 2026 Outage Post-Mortem
Bluesky's 8-hour, 50% outage resulted from unbounded concurrency in a newly deployed service overwhelming memcached, triggering cascading failures across the platform.
Friday, April 10, 2026 12:00 PM UTC2 MIN READSOURCE: Hacker NewsBY sys://pipeline
Tags
infrastructure