Tern LogoTern
← Back to all posts

They Migrated Every Slack Message, Right Before COVID Traffic Tripled

TR Jordan
They Migrated Every Slack Message, Right Before COVID Traffic Tripled

These are the highlights from an episode of Tern Stories. You can watch the full conversation with Madeline Wu Shortt on YouTube, Spotify, Apple, or wherever you get your podcasts.

Four engineers, one pet shard, zero customer incidents.

The Crisis (and the spoiler)

By late 2019, Slack’s biggest customer lived entirely on a single MySQL shard nick-named “542.” Capacity models said it would start dropping writes in 12-14 months if nothing changed. It was not the story Slack wanted to tell IBM and its 300k employees.

Madeline Wu Shortt, an engineer on the messaging team, volunteered to lead the fix. Ten months later, on April 2020, days after COVID sent global traffic through the roof, the cut-over finished early with no downtime and no user tickets.

Here’s how.

The Ticking Clock

One truth of every startup: the architecture that gets you to product-market fit will try to kill you at scale.

Slack’s original design sharded by team. Easy, if your biggest customer has 100 people. But IBM had 300,000 employees, and they would all live on shard 542. Every message, every thread, every reaction. One shard.

IBM hadn’t fully rolled out Slack yet, and the infrastructure team couldn’t run ALTER TABLE anymore. Too much data, too much risk. One schema migration could lock the table for hours.

By late 2019, they’d run the projections. Based on IBM’s message growth rate, shard 542 would start dropping writes in 12-14 months. The sales team was pushing IBM to go “wall to wall,” bringing every employee onto Slack. Previous Vitess migrations had taken 18 months with skeleton crews. The messages team knew they needed something different.

Four Engineers and a Deadline

Seagull meme about migrating

Madeline wasn’t supposed to lead infrastructure migrations. She was a product engineer on the messages team, a team of five who owned the most critical part of Slack. But when she partnered with Maggie from infrastructure to pitch this project, she had something previous migration leads didn’t: a non-negotiable deadline.

She got four engineers full-time: Slack’s largest migration team ever. Two from product who knew every weird edge case, two from infrastructure who understood Vitess. The scope was clear: migrate the data, nothing else. They found columns no one at Slack remembered creating. They kept them all.

Writing to both systems simultaneously revealed 45,000 diffs. Everything about a message lived in a blob called msg, and the existing tools just reported ”msg is different.” Useless. They built new logging to extract specific keys, turning “this 10KB blob doesn’t match” into “the edited_timestamp field is formatted differently.” Then they dumped it all into spreadsheets and triaged by hand.

Each mode transition was a product launch, complete with sticky notes tracking progress. But the go/no-go decision came down to spreadsheets and gut feel. “We needed something that could tell us ‘99% of diffs are resolved, these patterns are safe, you’re clear to proceed,’” Madeline recalls. Instead, she took personal responsibility for each transition.

February 2020: The World Tilts

They were eight months in, ahead of schedule. IBM was in sunset mode—reading from Vitess, still writing to both systems.

Then March hit. Every company went remote overnight. Shard 542 started showing replication lag at 9 AM Eastern every morning, deep red by noon, barely recovering by evening. The infrastructure team watched the graphs climb. Suddenly it wasn’t about hitting write wall in 2 months, it was about surviving until 3pm PT.

The work they’d put in during 2019 gave them an option, though: flip IBM to fully Vitess immediately. It broke protocol. Enterprise customers were supposed to go last. They did it anyway.

The effect was instant: 650 QPS to 250 QPS. Replication lag vanished. IBM’s 300,000 users didn’t notice a thing.

“If we hadn’t been in position to do this,” Madeline says, “I don’t know what we would have done.” Eight months of preparation had accidentally built them a COVID contingency plan.

What Actually Worked

They finished in April 2020, celebrating over Zoom as they dropped the legacy tables. Ten months exactly, zero customer incidents. When Amazon came calling—threatening to be 3-4x larger than IBM—the messages team just shrugged. “Channel-sharded on Vitess. Won’t be a problem.”

Real deadlines create real teams. The 12-month cliff got them four engineers when previous migrations scraped by with one. Existential threats clarify priorities.

Scope discipline saves migrations. They left old columns. They kept the same primary key. They touched nothing that wasn’t breaking. Every “while we’re in there” is a month added to the timeline.

Make progress observable. Each mode transition was a launch. Sticky notes on a timeline showed exactly where they were. When you can’t see progress on a 10-month project, morale dies at month six.

Product engineers need to own product migrations. Infrastructure knew Vitess. Product knew why that weird JSON field mattered. Together, they made decisions in days that would have taken weeks of back-and-forth.

Find Madeline at m@madelineshortt.com or on LinkedIn.

Watch the Full Episode

Never miss a post.