How Snap Rewrote Everything in 18 Months While 200M Users Kept Snapping

In late 2017, Ben Hollis was lying in bed with pneumonia, wondering if he should quit.

He’d just survived a death march. After Instagram cloned Stories, CEO Evan Spiegel had demanded a complete app redesign … in one month. The entire team was crammed into an LA aircraft hangar, typing away at long tables like a startup factory floor. Ben flew down from Seattle to visit, caught pneumonia from the crowded conditions, and spent December recovering.

The redesign shipped late. Also it didn’t work. If you look at Snapchat’s user growth chart, that redesign is where the numbers actually start going down. It wasn’t when Instagram launched Stories, but when Snap tried to fight back.

“I came back, healed up, and I was like, I don’t know what I’m doing here,”

The momentum had stopped. It stung that the project hadn’t succeeded, but more importantly: it had taken a quarter to do what used to take a week. The infrastructure couldn’t support the features Evan wanted to ship.

That’s when Ben and his team decided to do something that “never works”: rebuild everything from scratch.

Instagram Cloned Stories & Snapchat Took Months to Ship A Response

Snapchat’s backend was built on Google App Engine as a monolith: a reasonable choice when you’re a simple photo-sharing app. But as features accrued organically, the system had evolved into something far more complex.

Snapchat’s famous disappearing photos require something unusual: the client and server must cooperate perfectly. “If I send you a photo, and you view it, your device has to say, I viewed this photo, so the system can know to delete it,” Ben explains. Changing behavior in a cooperating client and server was delicate.

As a result, dead features haunted the codebase. Snapcash (Snap’s brief experiment with payments) had been pulled from the app years ago, but those payment messages still existed, saved in old chats. Any rewrite had to support every message type that had ever existed. “How do you even get one of those?” Ben recalls thinking. “You have to go find one and see what the format is.”

The infrastructure costs were astronomical. Ben can’t quote real numbers, but Snap’s S-1 filing revealed hundreds of millions in annual spending to Google and Amazon. The architecture that had evolved organically wasn’t just hard to change. It was expensive to run.

“We knew at the time it was insane,” Ben says about their decision to rewrite everything. “But there was this mood of like, well, why don’t we just try it? What do we have to lose?”

Migrating 200 Million Users One Conversation at a Time

Ben’s team made a counterintuitive choice: migrate individual conversations, not users or systems.

Your chat with mom could stay on the old Google App Engine monolith while your group chat moved to the new AWS DynamoDB backend. The client checked each conversation and dynamically chose which API to use. This meant building a backwards compatibility layer that could translate between two completely different architectures—conversation by conversation.

The key was verification. They built a dual read-write system that cloned conversations, split the writes, and diffed every response. Text messages: 100% match. Snaps: occasional differences. Investigate.

“That diffing system was the key to making this seamless. If we didn’t have that, it would have been, I don’t know, YOLO works on my machine.”

The mismatches revealed ghosts. Snapcash payment receipts from a feature killed years earlier. Message types from 2010 with properties nobody remembered. The compatibility layer had to handle them all—bug-for-bug compatible with code nobody understood.

When something went wrong, they could roll back a single conversation. No outages. No visible bugs. 200 million users kept snapping while Snapchat rebuilt everything underneath them.

Single-Packet Messages and Rural Cell Towers

The rewrite unlocked something they hadn’t planned for.

Ben’s team had been obsessed with message size from the start. Every byte directly translated to bandwidth bills and DynamoDB storage costs. So they redesigned everything: new APIs, new data structures, JSON to gRPC, even replacing the networking layer with Chrome’s Cronet.

The result: most messages now fit in a single network packet.

“If you go out into the boonies where the cell signal is pretty bad, Snapchat will work. Facebook Messenger, not gonna happen. iMessage, not as good.”

This wasn’t a feature they’d designed. It emerged from touching every part of the system at once. Incremental improvements to the old architecture could never have achieved this, because you’d always be fighting the accumulated overhead of a decade of features. But when you rebuild everything with a unified obsession with size, you get emergent properties: Snapchat working in rural dead zones where nothing else does.

That’s the difference between incremental migration and total reconstruction. You don’t just get a cleaner codebase. You get capabilities you never knew were possible.

Building for Bug-for-Bug Compatibility and AI

“The pain of creating that backwards compatibility layer could have been vastly accelerated,” Ben reflects.

The process was brutal: read thousands of lines of undocumented code, guess the behavior, deploy, wait for metrics, examine mismatches, fix, repeat. For every message type from 2010, every forgotten Snapcash receipt, every edge case nobody remembered.

“I can easily imagine building an LLM agent loop for that. The LLM every day looks at the logs, figures out what the bugs are, changes the code, tries it again. We could have all been on the beach sipping Mai Tais.”

The pattern here is universal. Every migration has this same structure: exploration, framework building, then execution. The Snapchat story isn’t unique in this. It just makes the pattern visible at massive scale.

The exploration phase of discovering what actually needs to be migrated is irreducibly human. Ben’s team had to decide that conversations, not users, were the right migration unit. Decide where they were going to split the request stream and diff. Pick what the new architecture should do.

But once you’ve built the framework the execution becomes mechanical. Read logs, identify mismatches, fix the code, verify. This is where AI accelerates everything.

“If you can pair up an agent that can affect change with a system that can accept change safely… you can really let loose,” Ben observes.

At Tern, we see this pattern everywhere: the hard part isn’t writing migration code, it’s building the understanding and framework that makes migration safe. Once you have that, whether it’s LLMs or “an army of interns” handling execution, the outcome is the same.

The Snapchat rewrite succeeded because they turned an impossible moonshot into thousands of mechanical, verifiable steps. That transformation is what makes any migration possible. And increasingly, it’s what makes migrations automatable.

You can find Ben Hollis and his team building the future of changeable infrastructure at stately.cloud.

These are the highlights from an episode of Tern Stories. You can watch the full conversation with Ben on YouTube, Spotify, Apple, or wherever you get your podcasts.

How Snap Rewrote Everything in 18 Months While 200M Users Kept Snapping

Instagram Cloned Stories & Snapchat Took Months to Ship A Response

Migrating 200 Million Users One Conversation at a Time

Single-Packet Messages and Rural Cell Towers

Building for Bug-for-Bug Compatibility and AI

Watch the Full Episode

Never miss a post.