'“We Should Be Able to Drain an AZ': Slack's Cellular Architecture

These are the highlights from an episode of Tern Stories. You can watch the full conversation with Cooper Bethea on YouTube, Spotify, Apple, or wherever you get your podcasts.
Slack didn’t go down—but it came close, twice in a single day.
A misconfiguration in AWS’s Transit Gateway, a relatively new networking component at the time, caused connectivity issues isolated to a single availability zone. TCP retransmits spiked. Retries cascaded across services. Sitewide error rates hovered around 1%.
It wasn’t catastrophic. But it was disruptive.
Slack’s infrastructure teams knew the cause. They could see the problem localized to a single AZ. And they knew what they wanted to do: shift traffic out of that zone.
But they didn’t.
“We hadn’t done it before. There was no playbook. If it failed, we’d be worse off than we started.”
Then AWS fixed the issue—only to inadvertently reintroduce it later the same day. That sealed it.
Cooper Bethea, a senior staff engineer on Slack’s traffic team, started writing. He had seen similar failure modes before at Google, where draining traffic from a failing datacenter was a routine operation. At Slack, it was still theoretical.
His doc was titled:
“We Should Be Able to Drain AZs.”
It wasn’t about reacting better in emergencies. It was about building the infrastructure, process, and defaults to make draining safe—so the team could act early, not late.
Slack Wasn’t Google—And That Was an Advantage
At Google, services are loosely coupled and independently deployable. That structure supports fine-grained control during infrastructure shifts—like draining traffic from a single zone or data center.
Slack was different. It was one product, with one tightly integrated codebase, and most traffic entered through a shared ingress path. That complexity could have been a liability—but it also gave the team a clear place to start.
“We weren’t thinking about per-service draining at first. We just wanted to clear everything out of a zone.”
Slack’s load balancing setup already routed requests through AZ-local reverse proxies, which in turn only forwarded traffic to AZ-local app servers. Those proxies were grouped behind a broader load balancer that wasn’t AZ-aware.
Cooper’s team took advantage of that structure.
The first experiment was deliberately scrappy: SSH into a subset of reverse proxies in the affected zone and knock over their health checks. As those instances dropped out of the load balancer pool, traffic gradually drained from the zone.
It worked—until it didn’t.
The Panic Routing Problem
The team quickly ran into a built-in feature of Envoy: panic routing.
When more than ~33% of backends appear unhealthy, Envoy assumes the system is misreporting and begins ignoring health checks entirely. Instead of routing traffic only to healthy proxies, it begins spraying requests across all remaining backends—healthy or not, AZ-local or not.
What started as a targeted drain turned into cross-AZ traffic and noisy, misleading metrics. The team had discovered something important: the problem wasn’t just how to drain traffic—but how the system itself would resist being drained.
From Global Drains to Service-Level Resilience
What began as broad, zone-wide draining naturally evolved into more detailed, service-by-service refinement.
The goal wasn’t a perfect system—it was resilience. And early experiments made it clear that while some services drained cleanly, others didn’t. The next step wasn’t to pause or re-architect, but to keep iterating, guided by what they observed in production.
Cooper Bethea helped turn that iteration into a habit. Each week, engineers from across Slack’s infrastructure teams would meet to run a zonal drain together—not as a drill, but as an opportunity to see what still needed work.
It was hands-on, data-driven discovery.
Each run surfaced new complexities:
- Some services didn’t shift traffic as expected
- Systems like Vitess, designed for regional resilience, needed new orchestration to support zonal failover
- Memcache assumed global state, and relied on consistency patterns that didn’t map cleanly to per-AZ deployments
- RPCs crossed zones due to older service discovery logic that hadn’t been updated for cell-aware routing
These weren’t subtle issues—they showed up clearly in dashboards: bytes per AZ, cross-zone traffic patterns, service-level error rates.
“We’d look at a graph and say, ‘This service isn’t draining.’ Then we’d go fix it.”
To stay focused, Cooper used a simple prioritization model: effort vs. value.
- Quick wins came first—stateless services, discovery fixes, visibility improvements
- Heavier lifts, like Vitess orchestration or Memcache isolation, followed once the foundation was in place
Drain practice wasn’t just about readiness. It became the way Slack discovered, prioritized, and delivered a more resilient system—one service at a time.
And with each iteration, the system got closer to a bigger goal:making the right thing the default.
Make It the Default
The final step wasn’t just enabling drains—it was shaping how the system behaved by default.
By the end of the project:
- Slack could drain an entire AZ in under five minutes
- Traffic defaulted to AZ-local routing
- Cross-AZ traffic became an explicit exception, not an invisible fallback
- New infrastructure followed these rules automatically
Draining an AZ was no longer an emergency workaround. It became a normal part of operations.
This wasn’t a top-down re-architecture. It was iterative, cross-functional, and grounded in real traffic. Cooper Bethea helped make it happen by setting a clear goal, building the habit of practice, and guiding the work toward defaults that stuck.
🎥 Watch the full episode of Tern Stories below for all the gory details—from panic routing to Memcache partitioning, and everything in between.
You can also find more from Cooper on LinkedIn.