Bad → OK → Awesome: A Field Guide to High‑Stakes Migrations

Twitter had six weeks before running out of tweet IDs.

In late 2009, tweets were stored in a MySQL DB with a signed 32-bit ID. The schema change was easy: just make it unsigned 32 bits, or 64 bits. 6 hours of downtime in exchange for a site that kept running.

What wasn’t easy was everything outside the database: thousands of 3rd-part clients had hard assumptions about ID size and type, caches and ORM layers stored them as integers, and whole companies mirrored those IDs in their own tables. A silent change wouldn’t just break pages. It would corrupt other people’s data.

These are the highlights from an episode of Tern Stories. You can watch the full conversation with Ryan King on YouTube, Spotify, Apple, or wherever you get your podcasts.

Twitpocalpse 1: A public migration

“The database didn’t care. Everyone else did.”

The first decision wasn’t about code. It was about coordination.

The hard part was convincing hundreds of developers that this was actually happening. Twitter’s tiny developer relations team had to reach out to API consumers individually. They published documentation, sent notifications, gave everyone a clear timeline. Some discovered their JSON parsers couldn’t handle the new size. Others found hardcoded assumptions buried in their code.

In less than 4 weeks, it became a movement. Someone even built a countdown website: the Twitpocalypse clock, ticking toward zero. With the whole ecosystem watching, developers who’d ignored the warnings suddenly discovered JSON parsers that couldn’t handle the new size, hardcoded assumptions buried in their code. The public countdown forced action where documentation hadn’t.

The technical execution was straightforward: use MySQL replication tricks to upgrade a replica first, then promote it to master. Total downtime: under an hour instead of 6.

It took 4.5 weeks of coordination to execute what was technically a one-hour migration. By the time they threw the switch, with just over a week to spare, every major consumer had updated their code. The migration was anticlimactic. That was the point.

Twitpocalypse 2: 64 bits

It took three years to burn through 2 billion IDs. The next 2 billion would take four or five months.

This time they needed 64-bit integers, and nothing was ready. Java: no unsigned 64-bit type. Ruby’s MySQL libraries: broken support. Multiple databases couldn’t even store the numbers they needed. Each compatibility issue meant another team, another language, another migration plan.

“What could have been just an ALTER TABLE statement took three and a half, four months,”

The command wasn’t more complex, but the ecosystem was more complex. More internal teams were secretly counting IDs for analytics. More third parties had built dependencies they didn’t know about.

But they had one advantage: organizational scar tissue. Everyone remembered Twitpocalypse 1. This time when Twitter said “urgent,” developers believed them. They started earlier, communicated louder, and assumed everything would take twice as long as estimated.

Sharding, but only for tweets

While they were solving ID exhaustion, another countdown started: database storage. Six weeks until the tweets database hit capacity. The biggest drives they could buy (maybe 2TB) would be full.

They had ideas. They could migrate to a distributed database, though there weren’t many good options in 2010. They could partition tweets by user like they’d done with the social graph. But every proper solution would take months they didn’t have. If usage spiked, six weeks could become four.

So they gave themselves three weeks.

With that constraint, only one approach made sense: partition by time. All new tweets write to the most recent shard. Reads walk backward until they find what they need. Since Twitter was still chronological (no algorithmic timeline yet!), most reads would hit the newest partition anyway. The time locality of tweets became the architecture.

The implementation was deliberately crude. YAML files in their Rails app told each server which database to talk to. To roll out a new partition, they’d deploy it with a broken table name so writes would fail and fall back to the previous shard. Then they’d run ALTER TABLE on the empty database to fix the name. Instantly, writes flowed to the new shard.

Three weeks from problem to production. No API changes. No downtime. Just a DBA provisioning new hardware when each partition approached capacity, and the system keeping ahead of the growth curve.

Eight Months to Nowhere

Temporal sharding bought them time, but at a cost. Every few weeks, another partition. Every partition, more hardware to rack, more YAML files to deploy, more replicas to manage. The DBA was basically living at the datacenter, constantly shuffling hardware to stay ahead of the growth curve.

They needed a real distributed database that could scale without manual intervention. But first they had to solve a prerequisite: distributed databases can’t use MySQL’s auto-increment IDs. You need to generate IDs anywhere before writing, but without collisions.

Snowflake took one hour to write: timestamp for rough ordering, machine ID for uniqueness, counter for throughput. So simple another engineer coded it before Ryan returned to his desk. But Twitter had been burned by “simple” before. Ryan ran it at maximum throughput for a month straight, generating trillions of IDs. Zero collisions.

The week before launch: “Hey, has anyone tested this in JavaScript?”

JavaScript stores numbers as floats. Integers over 2^53 get treated as floats. Three weeks of emergency coordination to ship IDs as strings everywhere. But it worked.

“Why would you do that to a programming language? Aren’t integers a good idea??”

Now they had everything: distributed IDs, breathing room, a bigger team, real budget. Cassandra was the obvious choice: Facebook had just open-sourced it, so Twitter would be an early adopter. They built clients in Ruby and Scala. They had feature flags, dark mode, incremental rollouts: infrastructure they’d only dreamed of during the crisis migrations.

Turn up traffic. Find edge case. Fix it. Repeat. Every week felt like progress.

Eight months later, still “almost there” … Ryan’s boss killed it. He took a 2-week vacation, starting that day.

The Framework That Stuck

During that break, Ryan developed the model that would guide Twitter engineering for years: Bad → OK → Awesome.

“When you’re drowning, you swim for shore, not the horizon.”

Bad → OK gets you breathing room: temporal sharding, dual-write periods, pointer flips. Whatever ships in three weeks.
OK → Awesome is where you build for the future: proper abstractions, cleaner boundaries, systems that scale without heroes.

The six-week constraint had been a gift. It forced them to map the actual problem: not “we need Cassandra” but “we need IDs that work everywhere, including JavaScript.” Not “we need perfect distribution” but “we need to stop manually racking servers every month.” The Cassandra project had no such constraint.

The countdown sites, the panicked API consumers, the JavaScript float problem nobody thought to check: these weren’t obstacles. They were the system teaching you about itself, one near-disaster at a time.