Making the Donuts: Stabilizing memcached at Slack's Scale

These are the highlights from an episode of Tern Stories. You can watch the full conversation with Glen Sanford on YouTube, Spotify, Apple, or wherever you get your podcasts.
When COVID-19 lockdowns hit in early 2020, Slack was suddenly one of three apps that kept the world together. Usage skyrocketed.
Glen Sanford had joined Slack in 2018. At the time, nobody owned memcached. It worked fine, meaning that outages were rare.
And it was true: memcached outages were rare. When they happened, though, they were spectacular. Losing the wrong memcached host meant cache misses surged, hammering Slack’s core database, Vitess. Enough misses, and even Vitess would collapse entirely.
The worst part? There was no clear way to fix it.
Over the years, Slack engineers had layered quick fixes on top of quick fixes: multiple mcrouter versions, inconsistent configs, and a fragile leader election system nobody wanted to touch. Even looking at the system wrong seemed to break it.
Management: “Do you want to own memcached?”
Glen: “Absolutely not.”
But with pandemic traffic surging and looming traffic encryption deadline, he agreed to audit it.
Making the donuts
Glen’s first contribution was simple, and extremely manual.
Slack’s memcached used consistent hashing to distribute keys across hundreds of hosts. When a new host joined the ring, it started cold. Every host reshuffled a slice of their keys off to another host, and those now-missing keys fell back to the database. If you added hosts too quickly, you risked overloading the database entirely.
“If you do it too fast, then the database melts down. If you do it too slow, it just takes a million years.”
The system Glen inherited was too fragile to automate. Prior deployment mechanisms were brittle and prone to failure, and small missteps could cascade into outages. His first change wasn’t to automate fully, but to build a script that added just enough structure to make manual operations safer. It let him expand the ring one host at a time, with manual gating and metrics review after every change.
This process was lovingly dubbed ”making the donuts”. The script didn’t remove the need for human judgment; it just made the job manageable. Every time there was a security patch, scaling need, or change to the memcached hosts, Glen pressed that button, watched the charts, and waited 90 seconds — hundreds of times.
It was a place to start!
Encryption adds pressure
Slack’s enterprise customers required end-to-end encryption — even between services inside Slack’s private network. That included memcached.
The team used a system called Nebula, which ran as a sidecar process on each host. It handled encryption and decryption transparently between services, but it had a fixed threading model — two threads, one for encryption and one for decryption.
But memcached wasn’t like other services. Each Slack API request might trigger dozens or hundreds of cache lookups. Encryption overhead added CPU load that quickly saturated the two threads Nebula used to handle traffic.
When you saturated Nebula, you were only able to use 200% CPU — two full cores.
Glen tuned the architecture: splitting memcached, encryption, and decryption workloads across 16-core machines. He measured throughput, staggered ring expansion, and justified a 3× hardware increase — all to keep the system from falling over.
Still, encryption couldn’t be fully rolled out. Something else was hiding beneath the surface.
Honeycomb and the mystery of the crashes
Despite all the precautions, something kept going wrong.
Occasionally, when a host was added or removed, the entire cache layer would falter. What should have been minor disruptions became major outages. No obvious metrics explained why.
Memcached is uniquely fast: the median request takes 1ms. Slack’s existing monitoring only offered per-minute granularity. In the dashboards, everything failure looked like an unexplainable spike, not a trend.
To dig deeper, Glen’s team instrumented memcached with Honeycomb.
“The new data exposed a hidden danger.”
Unlike traditional monitoring, Honeycomb allowed tracing at millisecond resolution and tagging events with rich context: key patterns, clients, traffic sources.
Certain cache keys were accessed far more than others — sometimes millions of times per second. Often these were tied to large customers or popular Slack bots. Because consistent hashing always assigned the same key to the same host, those so-called “hot keys” created bottlenecks.
“If a hot key’s host went offline or became overloaded, the fallout wasn’t localized. It rippled across the entire system.”
Fixing the hot keys
Armed with visibility, Glen moved high-traffic keys into a local cache layer that lived directly on the application servers. Instead of routing every request through the central memcached fleet, these keys could be served instantly from memory on the same machine that made the request. That eliminated the worst bottlenecks.
At the same time, Slack’s security team shipped a new version of Nebula that supported multi-core encryption. Combined with Glen’s core allocation strategy, memcached could finally handle its traffic without falling apart.
“We were able to get 93% of our original throughput — but encrypted. That was the unlock.”
From fragility to resilience
With observability and encryption in place, Glen and the team pushed the memcached infrastructure further:
- They built a config layer called McRib, making it easy to update and deploy McRouter settings.
- They regionalized the fleet, creating availability-zone-local cache rings for better fault isolation.
- And they automated what used to require donuts — including full hardware rotations.
“Once it’s a crank you can turn, it doesn’t make sense for people to turn the crank any more than it makes sense for people to power turbines.”
We get into all this and more in the full episode below. And if that’s not enough, give Glen at shout at glen.nu!