Tern LogoTern
← Back to all posts

Outscaling ElasticSearch at Datadog: Upgrading the Core Datastore

TR Jordan
Outscaling ElasticSearch at Datadog: Upgrading the Core Datastore

These are the highlights from an episode of Tern Stories. You can watch the full conversation with Ian Nowland on YouTube, Spotify, Apple, or wherever you get your podcasts.

By the late 2010s, Datadog’s observability platform was expanding rapidly. New products were coming online—logs, traces, synthetics—each ingesting large volumes of telemetry data. But most were building their own pipelines, using different infrastructure, and solving the same storage problems independently.

“We had to build a better platform. But first, we had to outscale Elasticsearch—without breaking anything it got wrong.”

What Datadog needed was a shared foundation: scalable, multi-tenant, and purpose-built for observability workloads. That meant rethinking ingestion, query patterns, and storage behavior across the board.

The result was Husky: a column-oriented data store built directly on S3, optimized for sparse queries, long retention, and high-volume ingest. It was designed to support every existing product—and to give new ones a fast, reliable path to production without reinventing infrastructure.

In this episode of Tern Stories, Ian Nowland—then SVP of Engineering at Datadog—explains how Husky came together, and how Datadog migrated core production systems without dropping customer traffic or trust.

Why ElasticSearch didn’t scale

Elasticsearch helped Datadog move fast in the early days. It was flexible, well-documented, and easy to prototype against. But as usage scaled, its limitations became harder to ignore.

It indexed everything on write, driving up ingest costs and making long-term retention expensive. Its clustering model wasn’t built for constant horizontal scaling—adding nodes under load meant rebalancing shards, often leading to instability. It also wasn’t optimized for sparse queries or time-partitioned data, which are fundamental to observability workloads.

Logs and traces pushed these problems to the limit. Logs are high-volume, semi-structured, and rarely read after ingest. Traces require low-latency exploratory queries. Neither workload mapped cleanly to a search engine built for full-text indexing.

Still, customers had learned to trust Elasticsearch’s behavior—including its bugs. Alerting logic often depended on subtle quirks in how queries resolved. So any replacement had to replicate those quirks to avoid breaking production.

“The Elasticsearch semantics became the correct semantics.”

Husky couldn’t just be more efficient. It had to be compatible with the mistakes that came before it.

Why APM was the right first step

Despite logs being larger and more operationally painful, APM was the first product to migrate—and technically, it made the most sense.

Its data model already assumed sampling and partial loss, which gave the team more tolerance for edge cases during the transition. Unlike logs, APM wasn’t deeply tied to compliance or security workflows. It didn’t require strict deduplication or exactly-once delivery.

It also benefited from existing structure: APM data was already bucketed by time and customer, making it easier to partition and route through the new system. Queries were typically exploratory, and more forgiving of small discrepancies—ideal for validating correctness in parallel with the legacy backend.

You can’t double-write a log line and accidentally trigger an alert. With APM, you had more room to maneuver.

The business case didn’t hurt—APM was growing fast. But the real reason it went first was simple: it was the safest and clearest way to prove Husky in production.

How Kafka made it possible

“If we can make Kafka really easy for product teams, they’ll stop caring about the backend. That gives us control.”

Before building Husky’s storage layer, the team focused on standardizing ingestion. Each product had its own Kafka and Elasticsearch pipeline, often with custom retry semantics and queueing behavior. Rather than force change from below, the platform team made it easy for product teams to hand off ingestion—abstracting away Kafka management and creating a clean boundary between ingest and storage.

This enabled shadow ingestion: data was written to both Elasticsearch and Husky, and a comparison layer validated correctness and performance live in production. It wasn’t enough to be faster—Husky had to match existing results exactly, including Elasticsearch’s quirks. That setup allowed for safe, incremental rollout: teams could shift traffic gradually, test queries against real workloads, and roll back when needed.

Once that pattern was established, migrations moved faster. New products were onboarded early. And because the platform now owned ingest, it took on compliance burdens like HIPAA and PCI—giving product teams one less thing to manage.

Wrapping Up: The Migration Behind the Migration

Every big migration starts with a technical constraint—but whether it succeeds depends on what surrounds it.

  • What guarantees did the old system make, intentionally or not?
  • Where can you hide change until it’s proven?
  • Which products are safe enough to go first?

Datadog’s migration to Husky wasn’t fast or simple. But it worked—because the team was deliberate about where to start, what to preserve, and how to earn trust.

As Ian put it, they had to outscale Elasticsearch without breaking anything it got wrong. In doing so, they built a foundation that could support everything that came next.

You can find Ian at Junction Labs at inowland@junctionlabs.io or on LinkedIn.

Watch the Full Episode

Never miss a post.