Migrate or die: overhauling the service catalog

These are the highlights from an episode of Tern Stories. You can watch the full conversation on YouTube, Spotify, Apple), or wherever you get your podcasts.
I recently chatted with Matt Ouille, senior software engineer at Slack, about a complete rewrite he did of an aging internal service catalog. The catalog was critical—it served as the single source of truth for systems across the company. After years of patching their Python 2-based monolith, Matt’s team hit a breaking point: migrate or accept endless outages. Here’s what I learned from Matt’s experience—the pitfalls, the surprises, and why it was all worth it.
Background and Motivation
“At some point, we faced a clear choice: migrate or die. So, we migrated.” – Matt
The original Flask-based catalog was drowning in technical debt—enormous JSON blobs in single database columns, outdated libraries, and severe scalability issues. Incremental patches weren’t cutting it, especially during sudden traffic spikes.
Three Paths to Scalability
Matt’s team took three decisive steps to solve their scaling crisis:
1. Embracing Async Python with FastAPI
Switching to FastAPI dramatically improved handling of bursty traffic, allowing the system to process many concurrent requests efficiently. Async endpoints replaced slow, blocking database calls. Matt’s team also leveraged FastAPI’s built-in OpenAPI support to auto-generate client libraries, easing internal adoption.
2. Monolith-Plus: Finding Balance
Rather than breaking into microservices entirely, they selectively extracted resource-intensive tasks—like managing large JSON fields—into specialized microservices. This hybrid architecture preserved the simplicity of a monolith while providing targeted scalability.
3. Moving from Batch to Event-Driven
The team replaced their batch-update model with an event-driven approach, flattening database load and ensuring fresher data. Incremental updates from upstream systems became real-time triggers, smoothing out traffic spikes.
The Mystery and Drama of “Page 11”
Despite these improvements, Matt ran into a peculiar issue: queries consistently stalled on “page 11.” After exhaustive optimization, the problem remained a baffling mystery. Frustrated and almost resigned, Matt erupted during a debugging session with a teammate, who then meticulously tested each database entry individually.
They discovered the true villain—three massive JSON fields hidden in plain sight, quietly sabotaging performance by ballooning memory usage and CPU cycles during routine pagination. Removing these problematic fields instantly stabilized the system. Matt described the moment vividly: “It felt like walking out of a desert and into a lush forest. Everything just suddenly worked.”
The lesson? Even small, short-term shortcuts can explode into major issues at scale.
Winning Over Internal Teams
Matt highlighted that working directly and empathetically with internal teams was key. Many teams genuinely wanted to move to the new system—they were suffering from the same availability and performance issues that Matt’s team was addressing. The challenge wasn’t convincing them to adopt; it was helping them find the time and resources to do it.
Matt’s approach combined clear communication of benefits with practical, hands-on support. He set clear deadlines to keep everyone aligned, while recognizing that simply stating deadlines wouldn’t solve everything. When teams struggled with time or complexity, Matt and his team proactively stepped in, offering direct support like writing migration scripts or even contributing directly to their codebases. This active partnership eased the transition, turning potential resistance into enthusiastic support.
Matt summed it up perfectly: “Their migration was us. If they got stuck, it was our responsibility to help unstick them.” This human-centric approach proved to be a massive superpower, bridging the gap between intention and capability, ultimately ensuring a smooth migration.
Key Lessons Learned
- Don’t Patch Core Issues: Sometimes only a rewrite will do.
- Microservices Aren’t Always the Answer: Selectively scaling specific components can simplify operations.
- Watch Your Data: Hidden bottlenecks often lurk in the most unexpected places.
- Event-Driven Wins: Incremental updates reduce spikes and keep data current.
- Empathy Matters: Supporting internal teams turns resistance into enthusiasm.
The Outcome
Today, Matt’s team enjoys a stable, scalable platform, free from constant firefighting, focusing instead on new features and integrations.
Migrations can transform technical pain into opportunity when tackled with clear strategy and genuine empathy.
Connect with Matt at ooo-yay.com (a phonetic play on “Ouille”), LinkedIn, or Hacker News under the same handle.