Tern LogoTern
← Back to all posts

When the Wrong Tool is the Right Choice: PagerDuty's Cassandra Queue

TR Jordan
When the Wrong Tool is the Right Choice: PagerDuty's Cassandra Queue

These are the highlights from an episode of Tern Stories. You can watch the full conversation with Arup Chakrabarti on YouTube, Spotify, Apple, or wherever you get your podcasts.

In 2011, PagerDuty was using MySQL as a queue. If you just winced, you understand the problem.

They were only handling tens of transactions per second, but disk writes were already struggling under synchronous replication. This wasn’t sustainable for their notification pipeline—the system that delivered every alert to their customers.

The stakes were clear. PagerDuty had made a specific promise to customers: when they accepted an event, it was replicated across multiple AWS regions before returning an HTTP 200. This wasn’t negotiable.

“The difference between a 200 and 500 is that if you served up a 200 and it was a 500, you lied,” Arup explains. For a company built on reliability, that distinction mattered.

So they needed a new solution. What they chose would make even less sense than MySQL—at least on paper.

The Decision That Shouldn’t Have Worked

When Arup and his team chose Cassandra as their solution, the maintainers’ reaction was immediate: Don’t do this.

The warnings were explicit:

  • Do not build a queue on top of Cassandra
  • Do not run Cassandra across multiple regions
  • Do not run Cassandra in high-latency network environments

They did all three.

“Everything you look at for any NoSQL database will say don’t do this,” Arup recalls. “It says don’t do this in flashing red.”

But here’s what the warnings didn’t account for: context matters more than correctness. Kafka existed but couldn’t provide the consistency guarantees they needed. They knew people on the Cassandra team. They had experience with it from Netflix.

“For startups, you don’t have time to figure out right. You only have time to figure out good enough,” Arup explains. And good enough meant choosing a tool they could understand and control, even if it wasn’t designed for their use case.

When they told the Cassandra maintainers they wanted to set timeouts to several seconds instead of tens of milliseconds, the response was predictable: “Wait, you want to use this thing? Three orders of magnitude? What?”

But the maintainers did something crucial—they listened. Instead of just saying “you’re holding it wrong,” they helped PagerDuty understand which config variables to adjust and in what order. Sometimes the best technical decision is finding experts who will help you do the “wrong” thing safely.

Making the Wrong Thing Work

The cutover from MySQL to Cassandra wasn’t elegant. They took actual downtime—a terrifying 20-30 seconds where nothing was written to the database.

This violated another best practice: zero-downtime deployments. But Arup’s team had done the analysis. “We realized it was actually safer to just take ten minutes,” he recalls. They’d practiced it dozens of times across staging and load test environments. The founders scrutinized every detail.

“You don’t get lucky with these kinds of things,” Arup notes. When you’re doing something unconventional, over-preparation is your insurance policy.

Once running, the real work began. Cassandra pre-1.0 was, to put it charitably, dynamic. A “minor” upgrade from 0.7 to 0.8 required rewriting the entire configuration file. “This isn’t backwards compatible!” Arup remembers thinking. “This is not complying with semantic versioning!”

The operational reality was even messier. They couldn’t use their configuration management tools because certain variables had to be updated in specific sequences across the cluster. Instead, they SSHed into machines and edited configs by hand.

“We hated it. We were so annoyed that we had to do it, but we couldn’t figure out a better way.”

Yet they persisted. Why? Because when you choose the “wrong” tool for the right reasons, you accept the operational tax as part of the deal. They knew what they were signing up for. Every painful SSH session, every manual config edit—these weren’t surprises, they were tradeoffs they’d consciously made.

The Hidden Costs and Unexpected Benefits

Running Cassandra as a queue across regions created what Arup calls “legendary software”—not legacy, but legendary. “Give it the respect it deserves,” someone once told him at a panel. “Recognize that probably a lot of your revenue is coming from that legendary software.”

It worked. When AWS regions failed (remember, this was the early 2010s when this happened regularly), PagerDuty stayed up. Their architecture could lose an entire region and keep humming. Sometimes the “wrong” architecture gives you the right resilience.

But the psychological tax was real. Every engineer who touched the Cassandra queue knew it was brutal. Upgrades felt like “magic incantations.” As the team grew from 20 to hundreds of engineers, that fear became a bottleneck.

“It just wreaks havoc on productivity, wreaks havoc on innovation,” Arup admits. The true cost of technical debt isn’t in the code—it’s in the courage it takes to change it.

Knowing When Wrong Becomes Wrong Enough

By 2018, the equation had changed. Kafka had matured to provide the consistency guarantees PagerDuty needed. The company had grown to nearly 10,000 customers. They had the resources and expertise to do things “right.”

The migration was careful and deliberate—running both systems in parallel, slowly bleeding traffic over. With scale came new options: they could partition their customer base, testing on free-tier users first. Scale doesn’t just bring problems; it brings solutions.

When they finally deprecated the last Cassandra queue, “a wave of relief came over the entire engineering organization.”

Looking back, Arup is adamant: “I will die on this hill. That was absolutely the right decision for PagerDuty at the time.”

He’s right. They chose a path where they could understand the risks, control the implementation, and meet their consistency requirements. They traded elegance for predictability, best practices for business needs.

Most importantly, they were honest about it. They knew it would be painful. They knew it wasn’t “correct.” But they also knew that in the space between “perfect” and “possible” lies a sweet spot called “pragmatic”.

Sometimes the wrong tool is the right choice. The wisdom lies not in the choice itself, but in understanding why you’re making it, what it will cost you, and—crucially—when it’s time to move on.

Just maybe start planning that Kafka migration sooner rather than later.

You can find Arup at arupchak.com or on LinkedIn.

Watch the Full Episode

Never miss a post.