Slack's 6am Database Club

These are the highlights from an episode of Tern Stories. You can watch the full conversation with Maude Lemaire on YouTube, Spotify, Apple, or wherever you get your podcasts.
Slack didn’t plan this migration in advance.
It started because IBM was breaking the system.
As Slack’s largest customer ramped up, tens of thousands of people began logging in around the same time every day. That surge exposed one of the company’s most sensitive bottlenecks: channel membership.
Every feature in Slack—notifications, search, file access—relies on knowing what channels you’re in. But that logic lived across three different MySQL tables, with raw SQL scattered throughout the codebase. Each login triggered a heavy UNION ALL, executed tens of thousands of times, during the highest-traffic window of the day.
It got so bad that Slack engineers were coming in at 6 A.M., just to watch the database and make sure it survived the login spike.
One of the engineers who took that on was Maude Lemaire—two years out of college, and newly responsible for one of the most critical migrations in the company.
Where Channel Membership Hurt the Most
Slack’s membership model was split across three tables—one for public channels, one for private channels, and one for DMs. Every login required multiple reads, stitched together into a heavyweight UNION ALL. It worked—until IBM.
One day, someone archived a channel with 18,000 members. That kicked off 18,000 individual DELETEs—sequentially, in a tight loop—right as everyone was logging in.
“It felt like swallowing a goat. You archive a channel, and Slack tries to delete 18,000 rows in real time—while everything else is happening.”
The system couldn’t keep up. The solution Maude started working toward was a single channels_members table: one row per person per channel. No joins, no branching logic—just one place to check access, no matter the channel type.
How a Wrapper Made Migration Possible
Raw SQL wasn’t the problem. Neither was having multiple access paths. In most systems, that’s fine.
But this wasn’t most systems. This was a critical path under load, and the system was breaking in real time. When the same logic was scattered across the codebase, it made it nearly impossible to debug or change anything safely—especially while it was on fire.
So Maude made a tactical move: she pulled every channel membership query into a single file with a set of shared functions. Not to abstract, just to centralize. That gave the team one place to stage changes, wire up feature flags, and swap implementations on the fly.
“We didn’t need perfection. We needed control.”
Which turned out to be critical. At one point, Maude accidentally flipped Slack itself—not IBM—over to the new table too early. Channel access broke across the company. But because everything was routed through a single control point, the fix was clean: one flag, one file, instant rollback. Only 1 person noticed.
Testing in Production, Forever and Always
There’s a lot of gruntwork in a migration like this—moving writes to the new table, syncing rows, flipping over reads. But what mattered most wasn’t the mechanics. It was what the experience unlocked.
At Slack, Maude and her team rolled out the membership rewrite gradually—starting with a handful of IBM users known to have massive channel memberships, then ramping to 10%, 25%, 50%, and beyond. It was tightly monitored, but it was still production. Staging couldn’t replicate IBM’s scale, and the only way to know if it was working was to try it live.
That experience—of debugging a system under real-world pressure—shaped how Slack scaled for future customers like Amazon.
When it came time to support even larger orgs, Maude built tooling to simulate login spikes using synthetic users, encouraged teams to add circuit breakers to shut things down at the first sign of instability, and monitored Grafana dashboards in real time to see what broke.
“We wrote this script that pretended to be a bunch of users logging in, just smashing the login endpoint, and then we had a Grafana dashboard with red and green lights. If the system started choking, we’d flip the breaker.”
It wasn’t just about validating changes. It revealed bugs in fallback paths, subtle MySQL behaviors, and the true cost of recovery logic under pressure. Testing in production wasn’t a fallback—it became the strategy.
And that centralized wrapper for channel membership? It made future work easier too—like migrating the entire table to Vitess. Small decisions, made during a fire, that gave Slack leverage for years.
Want more from Maude?
🐸 Explore her work at maudethecodetoad.com
📖 Check out her book Refactoring at Scale from O’Reilly: Read it here →
🎥 Watch the full episode below: