The Storm Chaser Playbook

These are the highlights from an episode of Tern Stories. You can watch the full conversation with Sarah Mann on YouTube, Spotify, Apple, or wherever you get your podcasts.
A guest user saw something they shouldn’t.
Slack takes channel permissions seriously. It’s the core of why users trust it. Imagine tracing that bug: permissions logic scattered across services, a flag called guest_status set inconsistently in half a dozen places, each touched by a different async job. No obvious crash. No clean rollback. Just subtle, persistent drift.
“We had a lot of meetings where we went, it works that way? why?” Sarah recalls.
It wasn’t a one-off. Slack’s Storm Chasers found the same pattern again and again: data that had quietly rotted over time, long after the original bug was fixed.
Instead of treating each cleanup as a one-off, Sarah’s team built a playbook that helped them restore trust in the system and keep it that way.
1. Write Contracts
Writing data contracts might sound abstract, but it’s deeply tied to what users actually experience. When Sarah’s team says something like “a deactivated user has no valid sessions,” that’s not just an internal rule—it’s a check on what the production system is really doing.
These contracts become a bridge between the code you write, the data it produces, and the behavior users see. If they drift apart, even a little, you get bugs that are almost impossible to spot from code alone.
Some of the earliest contracts they wrote were deceptively simple:
- Guest status is a single, global property per user.
- A deactivated user has zero valid sessions.
- Inviting someone always creates exactly one user row.
Each one captured a user-facing truth and helped surface cases where different parts of the system disagreed.
Contracts like these turn ambiguity into action. They anchor cleanups in what actually matters: production behavior.
2. Make a Dashboard
Once the contracts were written, Sarah’s team asked the obvious next question: are they true right now? The answers came from nightly production snapshots loaded into the data warehouse, where each contract turned into a simple SQL query. One counted users with multiple guest flags. Another flagged deactivated accounts with open sessions.
The output was a dashboard—a fast, visual audit of what was broken and how badly. Violations were piped into Slack via scheduled queries, so new drift showed up as it happened.
The dashboard wasn’t just about monitoring cleanup—it uncovered real bugs no one knew were there. Contracts meant to guide a migration ended up surfacing live issues: permissions gone wrong, sessions that wouldn’t expire, users stuck in impossible states.
“Don’t look at logs on a Friday afternoon,” Sarah jokes, “or you’ll realize you’ve been in an incident since 2022.”
This approach flipped the cleanup on its head. Instead of pushing checks into the app, the team pulled truth out of production. It was cheap, fast to change, and exposed exactly where the system was already out of spec.
3. Burn It Down
Once the drift was visible, Sarah’s team started burning it down. Carefully. They tackled both data and code: migrating guest flags into a single column, backfilling hundreds of millions of rows, and rewriting helpers used in 300+ call sites. Everything shipped in small, scoped pull requests.
“We’re not going to take down Slack,” Sarah says. “We’ll have SEV-2s and 3s, but we’ll spread them across 40 changes.”
That discipline worked. Every dashboard hit zero.
But zero wasn’t the finish line—it was the guardrail. Once the data was clean, the same dashboard queries were wired to alerts. If anything slipped, the team would know within minutes.
This last step—the ratchet—is what makes the playbook sustainable. Fixes don’t silently rot. Every regression becomes visible. And migrations don’t need flag days to hold: they can land slowly, safely, and still stick.
Why this playbook works
This playbook works because it forces decisions. Sarah’s team didn’t start with refactors—they started by writing down what should be true. That clarity turned a messy system into something you could reason about.
From there, everything flowed: dashboards surfaced hidden drift, not just migration scope. Tiny PRs landed without SEV-1s. And once the system was clean, alerts kept it that way.
“You want a place where you can say, this is true now, and will remain true.”
Migrations stall when teams chase symptoms or debate what matters. This approach cuts through the ambiguity. You write contracts, test them against prod, and build a ratchet so fixes stick.
It’s not just safe. It’s fast because it’s focused. Automation here doesn’t replace judgment. It enforces it.
That’s why this story isn’t just about cleaning up a guest flag. It’s about building a system you can trust and proving it, every night.
Connect with Sarah on LinkedIn.