Elasticsearch

A production tour of Elasticsearch: the inverted index, the refresh/flush/merge cycle, scatter-gather queries, mapping explosion, heap death spirals, and why it is never your source of truth.

21 min readupdated 2026-06-28

On this page

Elasticsearch gets adopted as “a fast search box you point at your data.” That framing is fine right up until the moment someone makes it the primary store for that data, a node spends ninety seconds in a garbage-collection pause, the master drops it from the cluster, its shards relocate under load, and a search latency blip becomes a cluster-wide outage. Elasticsearch is a distributed coordination layer wrapped around Apache Lucene, optimized for one job: answering full-text and analytical queries over enormous document sets, fast.

Almost every operational surprise traces back to that lineage. It is near-real-time, not real-time. Its writes are not transactional. Its segments are immutable, so an update is really a delete-plus-reinsert with deferred cleanup. Its cluster state is a single replicated structure that an innocent-looking mapping can inflate until coordination grinds to a halt. Treat it as a search engine and it is genuinely superb. Treat it as a database of record and it will, eventually, hurt you.

This is the long-form context article — the thing you wish someone had handed you before your first Elasticsearch incident. It leans on Sharding & Partitioning for the distribution and quorum theory, PostgreSQL and DynamoDB for where the canonical copy of your data actually belongs, and Observability for the metrics that tell you a cluster is sick before users do. The appendix has a short refresher on inverted indexes and TF-IDF if you want the search fundamentals first.

If you remember one thing: in Elasticsearch the metric on fire is rarely the metric you were watching.

A motivating failure

A product team ships a feature that lets customers attach arbitrary key-value metadata to records — {"region": "emea", "tier": "gold"} — and they index those records into Elasticsearch so the metadata is searchable. Dynamic mapping is on, because it always is by default. For months it works perfectly.

Then a large customer integrates their own system and starts writing metadata where the keys are unique identifiers: {"order_8f21a": true, "order_9c03b": true, ...}. Every distinct key becomes a new field in the index mapping. Every new field is added to cluster state — the single source of truth about every index, shard, and mapping in the cluster — which is replicated to every node on every change.

Nothing in the dashboards looks wrong at first. CPU is moderate. Heap is fine. Query latency is normal. But cluster state has quietly grown from a few hundred kilobytes to forty megabytes, and now every mapping update has to serialize, replicate, and acknowledge that blob across the cluster. The pending-tasks queue on the master starts climbing. Index creation hangs. New nodes can’t join because they time out receiving cluster state. The whole cluster freezes — not from load, but from coordination overhead — while the CPU graphs everyone is staring at sit at 20%.

The on-call engineer reboots the master, which makes it worse: the new master has to publish that bloated state to everyone again. The fix was three commands — set index.mapping.total_fields.limit, switch the index to dynamic: strict, and reindex without the offending fields — but it took four hours to find, because the failure lived in a metric nobody had a dashboard for. That is the Elasticsearch failure pattern in one story: the thing that fell over was not the thing under load.

The one-sentence mental model

Elasticsearch shards your documents across nodes, builds an inverted index (term → list of documents) inside immutable Lucene segments on each shard, and makes new writes searchable only after a periodic refresh — trading real-time visibility and update-in-place for query speed a row store can’t touch.

Every clause is an operational constraint:

Inverted index → it is built to answer “which documents contain this term,” not “fetch this one row by id.” The latter works but is not what it’s for, and designing around id-lookups wastes its strengths.
Immutable segments → an update can’t edit a file in place. It marks the old document deleted and writes a new one; the space comes back only when segments merge.
Searchable only after refresh → a document you just indexed is invisible to search for up to a second by default. That gap is a throughput feature, and fighting it is how you wreck indexing performance.
Sharded across nodes → a query is a scatter-gather: fan out to every shard, gather and merge results. That shapes both latency and the cost of deep pagination.

flowchart LR
  D[Doc\nthe fast fox] --> A[Analyzer\ntokenize\nnormalize]
  A --> T1[term fast\ndoc 7, 9]
  A --> T2[term fox\ndoc 7]
  T1 --> S[(Lucene segment\nimmutable)]
  T2 --> S
  S --> Sh[(Shard\n= Lucene index)]
  Sh --> N[Node]

A shard is a self-contained Lucene index — the unit of distribution and the unit of recovery. An inverted index maps each term to the documents that contain it, which is why “find every doc mentioning timeout” is O(matching docs), not a full scan. A segment is an immutable file holding part of that index; a shard is a stack of segments plus a tiny live commit point that says which segments are current.

How it actually works

Analysis: where relevance is won or lost

Before a document is indexed, an analyzer turns its text into terms: it tokenizes on word boundaries, lowercases, drops stopwords, and stems (running → run, mice → mouse with the right analyzer). Those terms — not the original text — are what the inverted index stores. The same analyzer runs on query text, which is why searching Running matches a document that only ever contained runs.

Two consequences bite in production. First, the analyzer is part of your data model, fixed per-field in the mapping, and changing it requires reindexing — you cannot retroactively re-stem a billion documents in place. Second, a mismatched analyzer fails silently: relevance just quietly gets worse, no error, no alert. Teams discover months later that their keyword-typed field was never analyzed at all, so partial-word search never worked. Decide analyzers deliberately, and test relevance with real queries before you trust it.

Segments: write, refresh, flush, merge

Lucene segments are immutable, and that single fact shapes the entire write path:

A new document lands in an in-memory buffer and is appended to the per-shard translog (transaction log) for durability. At this point it is durable but not searchable.
On refresh (default index.refresh_interval = 1s), the buffer is written to a new segment and that segment becomes searchable. This is the near-real-time boundary.
On flush, the in-memory segments are committed to disk with Lucene fsync and the translog is truncated. This is the real durability checkpoint that survives a crash.
Every refresh creates a new segment, so segment count climbs. A background merge combines small segments into larger ones — and this is the only time deleted and updated documents are physically purged. An update marks the old doc deleted in its old segment and writes a fresh doc; the disk space is reclaimed at merge, not at update.

sequenceDiagram
  participant App
  participant Buf as Buffer
  participant TL as Translog
  participant Seg as Segments
  App->>Buf: index doc
  App->>TL: append (durable)
  Note over Buf,Seg: refresh every 1s
  Buf->>Seg: new segment (searchable)
  Note over TL,Seg: flush
  Seg->>Seg: fsync, truncate translog
  Note over Seg: background merge
  Seg->>Seg: combine segments, purge deletes

The biggest lever for bulk indexing falls straight out of this lifecycle: during a large load, raise index.refresh_interval to 30s or set it to -1 to disable refresh entirely. You stop minting a new segment every second, merges catch up, and indexing throughput can jump several-fold. Re-enable it when the load finishes. A second lever: set replicas to 0 during the initial bulk load and add them back afterward, so you index each document once instead of once per replica — but only when you can tolerate the no-redundancy window.

Shards and replicas

An index is split into primary shards and replica shards. The primary count is fixed at index creation — you cannot change it without reindexing into a new index — while the replica count is changeable at any time. A write goes to the primary first, then is replicated to each replica before the client is acknowledged at the default write semantics. Replicas serve reads and provide failover if a primary’s node dies.

The cardinal sizing mistake is over-sharding: hundreds of tiny shards because someone picked a big primary count “for headroom.” Each shard is a full Lucene index with its own memory overhead, file handles, and merge scheduling. Aim for shards in the 10–50 GB range, and keep total shard count proportional to heap — a long-standing rule of thumb is ≤ 20 shards per GB of heap. A 32 GB-heap data node should be carrying on the order of low hundreds of shards, not thousands.

Query routing: scatter, gather, fetch

A search is not one operation; it is a two-phase scatter-gather. A coordinating node receives the request, fans it out to one copy of every relevant shard (query phase), each shard returns just the top-N document ids and scores, the coordinator merges those into the global top-N, then fetches the full documents in a second round (fetch phase).

flowchart TD
  Q[Client query] --> C[Coordinating node]
  C --> S1[Shard 1\ntop N ids]
  C --> S2[Shard 2\ntop N ids]
  C --> S3[Shard 3\ntop N ids]
  S1 --> M[Merge\nglobal top N]
  S2 --> M
  S3 --> M
  M --> F[Fetch phase\nfull docs]
  F --> R[Response]

This is why deep pagination is poison. To return page 1000 with from: 10000, size: 10, every shard must produce its top 10010 documents so the coordinator can merge and discard 10000 of them — per shard, per request. The cost scales with from + size across all shards. Use search_after with a sort key (or a Point-In-Time / scroll for exports) instead, which carries a cursor and pages without re-sorting from the top.

Query context vs filter context

Every clause runs in one of two contexts, and mixing them up is one of the most common performance bugs.

Query context asks “how well does this match?” and computes a relevance _score. Use it for full-text relevance — match, multi_match, phrase queries.
Filter context asks “does this match, yes or no?” There is no scoring, and the result is cached as a bitset. Use it for exact, boolean, range, and term conditions: status = active, created > X, tenant_id = 42.

Putting an exact term filter inside must (query context) throws away the filter cache and burns CPU computing a meaningless score. Wrap every non-scoring condition in filter; reserve must for clauses where relevance ranking actually matters. On a high-QPS dashboard query, moving the boilerplate filters out of query context is frequently a 2–5× latency win for free.

The tradeoffs that bite

Decision	Looks free	Actually costs
`refresh_interval = 1s` always	Fresh results instantly	Segment churn, merge pressure, lower indexing throughput
Dynamic mapping on	Zero schema work	Mapping explosion — unbounded fields bloat cluster state
Many primaries “for headroom”	Future-proof scaling	Per-shard heap + file-handle overhead, slow cluster state
System of record	One fewer datastore	No transactions, soft durability, painful reindexing
Deep `from`/`size` paging	Simple to build	Every shard sorts `from+size` docs per request
Aggregations on text fields	Rich analytics	Field data loads into heap; a top cause of OOM

Mapping explosion earns its own paragraph because it caused the outage at the top of this article. With dynamic mapping on, every previously-unseen field name creates a mapping entry that lives in cluster state, replicated to every node on every change. Index documents with unbounded key names — user-supplied JSON used as field names, error payloads with embedded ids — and cluster state inflates into the megabytes, cluster-state updates slow to a crawl, and the whole cluster stalls on coordination even though CPU and heap look healthy. Defend with index.mapping.total_fields.limit (default 1000) and dynamic: strict on any index fed user-controlled JSON, so an unexpected field is rejected rather than silently absorbed.

Field data is the other quiet killer. Aggregating or sorting on an analyzed text field forces Elasticsearch to load an uninverted, in-memory structure (field data) onto the JVM heap. On a large field this can consume gigabytes and never be released. Use keyword fields (backed by on-disk doc_values) for aggregations and sorting, and leave text for relevance search only.

Query performance

Search is fast when the working set lives in the OS page cache and the query touches few segments with cheap clauses. The levers, in rough order of impact:

OS file cache, not heap. Lucene memory-maps segment files and relies on the operating system’s page cache to keep hot index data in RAM. This is why you cap JVM heap low and leave the rest of the box’s memory free — the file cache is doing the heavy lifting, and a too-large heap starves it.
Filter context + caching. Move every non-scoring clause into filter so the bitset cache absorbs repeated conditions (above).
Fewer, larger segments. A query visits every segment in a shard. A shard with 200 tiny segments is slower than one with 10 healthy ones; healthy merging keeps segment count down. force_merge to one segment is appropriate for read-only indices (yesterday’s logs), never for actively-written ones.
search_after over deep paging, and keyword/doc_values over field data for sorts and aggregations.
Replicas for read fan-out. Each replica is another copy a search can hit, so read throughput scales roughly with replica count — at the cost of indexing each doc into every replica.

Watch took, but trust per-shard timings more: the slow-query log and the profile API tell you whether time went to query execution, scoring, or the fetch phase. A query that’s fast on a small index and slow in production is almost always either deep pagination, an aggregation pulling field data, or a fan-out across far too many shards.

Indexing performance

Indexing throughput is governed by how often you refresh, how merges keep up, and whether you batch.

Bulk, always. Use the _bulk API with batches in the 5–15 MB range. Single-document indexing wastes a network round trip and a coordination hop per doc; bulk amortizes both.
Raise refresh_interval during loads (30s or -1), as covered above. This is the single biggest indexing lever.
Mind the merge. Merging is I/O- and CPU-heavy; on spinning disks or throttled cloud volumes it becomes the bottleneck, and segments pile up faster than they combine. Symptoms: segment count climbing without bound, indexing latency rising, merges.current staying high. Fast SSD/NVMe and sane indices.memory.index_buffer_size keep merges ahead.
Translog durability tradeoff. By default the translog fsyncs on every request (index.translog.durability = request), which is safe but costs an fsync per bulk. Setting it to async with a sync_interval batches those flushes for higher throughput at the cost of a small loss window on crash — a deliberate choice, identical in spirit to synchronous_commit in PostgreSQL.

The asymmetry to internalize: indexing is a write you can usually batch, defer, and retry, while a stalled merge or an over-eager refresh quietly taxes every write. Tune the cycle, not the individual write.

Failure modes

The infamous one is split-brain. Historically (pre-7.0), if discovery.zen.minimum_master_nodes was misconfigured, a network partition could let each side elect its own master. Both halves accept writes, the indices diverge, and on heal you have two conflicting versions of the truth with no clean merge. Elasticsearch 7.x replaced master election with a proper consensus protocol that computes quorum automatically, removing the foot-gun — but the lesson stands for anyone on older clusters, and the underlying quorum theory is worth understanding regardless (it’s in Sharding & Partitioning and Consistency & Consensus).

The second is the mapping explosion / cluster-state stall from the opening story. Symptoms: cluster-state size creeping up, the master’s pending-task queue climbing, cluster updates timing out — all while CPU and heap look fine. Root cause: unbounded field creation under dynamic mapping. Prevention: total_fields.limit and dynamic: strict on user-facing indices.

The third is the heap pressure GC death spiral, and it cascades viciously:

flowchart TD
  FD[Field data\n+ too many shards] --> H[Heap fills]
  H --> GC[Long GC pause]
  GC --> NR[Node unresponsive]
  NR --> DROP[Master drops node]
  DROP --> REL[Shards relocate]
  REL --> LOAD[Load on survivors]
  LOAD --> H

A node fills its heap with field data and shard overhead, hits a multi-second stop-the-world garbage collection, goes unresponsive, the master evicts it, its shards relocate to the remaining nodes, that relocation traffic pressures their heaps, and the failure rolls around the cluster. Cap heap at ≤ 31 GB (above it the JVM loses compressed object pointers and you get less usable heap), set min = max, keep field data off text fields, and keep shard count proportional to heap.

Elasticsearch is not your source of truth. Replicas are not backups, refresh is not a commit, and the consistency model was never designed to survive partitions the way a quorum database is. If losing the index would lose data you cannot regenerate, that’s an architecture bug — keep the canonical copy in a real database and treat the index as a derived, rebuildable projection.

A fourth, quieter mode: disk watermarks. When a data node crosses cluster.routing.allocation.disk.watermark.high (default 90%), Elasticsearch starts relocating shards off it; cross the flood_stage (95%) and indices on that node go read-only automatically, and they do not flip back without manual intervention even after you free space. A logging cluster that fills its disk overnight will reject writes the next morning with a cryptic read-only-allow-delete block. Alarm on disk well before the watermarks, and size retention so you never coast into them.

Scaling it

Elasticsearch scales horizontally well if you respect shard sizing and node roles. The work at 10× and 100× is mostly layout and dedicated roles, not raw node count.

Separate node roles. Dedicate an odd number of master-eligible nodes (3 is standard) that do no data work, so a query storm or a hot data node can never knock out the cluster’s coordination. Data nodes handle indexing and search; coordinating-only nodes absorb query fan-out and result merging without holding shards.
Time-based indices for logs and metrics — one index per day or per rollover threshold — managed by Index Lifecycle Management (ILM). This keeps each shard in the healthy band and makes deleting old data an index drop instead of a billion-document delete.
Hot–warm–cold–frozen tiers. ILM rolls an index over at a size or age threshold and migrates it down a storage hierarchy as it ages out of active use:

flowchart LR
  W[Writes] --> HOT[Hot tier\nfast NVMe]
  HOT -->|rollover| WARM[Warm tier\nread mostly]
  WARM -->|age| COLD[Cold tier\ncheap disk]
  COLD -->|age| FROZEN[Frozen tier\nobject storage]
  FROZEN --> DEL[Delete\nat retention]

Recent, hot data sits on expensive fast storage; older data drifts to cheaper tiers, and the frozen tier can keep searchable data on object storage at a fraction of the cost, fetching segments on demand. This is how you retain a year of logs without paying NVMe prices for all of it.

Scale reads with replicas, scale coordination with coordinating nodes. At high query load, add replicas (each adds a searchable copy) and front the fan-out with coordinating-only nodes — while keeping primary count stable to avoid reindexing.

The wall you hit is rarely total data volume. It’s cluster-state size (too many shards or fields) or heap pressure (field data, shard overhead). Both are layout problems you design away, not capacity problems you buy your way out of.

When to reach for it (and when not to)

Reach for Elasticsearch when you need full-text search with relevance ranking, fuzzy and typo-tolerant matching, faceted or aggregation-driven search, autocomplete, or log and metrics analytics at scale (the Elastic/ELK stack). It shines as a secondary, derived index kept in sync from your primary store via change data capture or a stream — the canonical write goes to your database, a pipeline projects it into the index.

Don’t reach for it as your system of record: its durability and consistency model are not built for that, and reindexing a primary store you can’t rebuild is a nightmare. Don’t use it for transactional workloads or anything that needs read-your-writes inside the refresh window. And don’t stand up a distributed cluster when a database’s built-in search already covers you — PostgreSQL tsvector/tsquery full-text search handles a surprising amount of “search” without a second system to operate, monitor, and keep in sync.

When to consider alternatives

Search your relational data already covers → PostgreSQL full-text (tsvector), trigram, and GIN indexes — no separate cluster to run.
Source of truth and transactions → PostgreSQL or DynamoDB; project into the index, never from it.
The change stream feeding the index → Kafka for CDC and reliable fan-out into Elasticsearch.
Sub-millisecond key lookups and counters → Redis, not a scatter-gather search.
Cheap long-term retention of cold data → tier to object storage via ILM frozen tier rather than keeping it on hot nodes.
Distribution/quorum fundamentals behind all of this → Sharding & Partitioning and Consistency & Consensus.

Operational checklist

Run an odd number of dedicated master-eligible nodes (3) that hold no data, so coordination survives a data-node meltdown.
Cap JVM heap at ≤ 31 GB, set min = max, and leave the rest of RAM free for the OS file cache Lucene depends on.
Keep shards in the 10–50 GB range; alarm on total shard count relative to heap (~20 shards/GB).
Set index.mapping.total_fields.limit and dynamic: strict on any index fed user-controlled JSON.
Raise index.refresh_interval (30s or -1) and drop replicas to 0 during bulk loads; restore both after.
Use filter context for all non-scoring clauses; reserve must/query context for relevance ranking.
Use search_after (or PIT/scroll) instead of deep from/size pagination.
Aggregate and sort on keyword/doc_values fields, never on analyzed text (field data → OOM).
Alarm on disk usage below the 85%/90%/95% watermarks; a flood-stage index goes read-only and won’t auto-recover.
Treat the index as derived: keep the canonical copy elsewhere and rehearse the reindex/rebuild path before you need it.

Summary

Elasticsearch is the right tool for search and analytics and the wrong tool for being your database. Almost all of its sharp edges come from a handful of design facts: it builds immutable inverted-index segments (so updates defer cleanup to merge time and refresh gates visibility), it fans queries out scatter-gather across shards (so deep pagination and over-sharding are expensive), and it coordinates the whole cluster through a single replicated cluster state that an unbounded mapping can bloat into a stall. Cap your heap, leave RAM for the file cache, keep shards in a healthy size band, lock down dynamic mapping, filter instead of scoring where you can, and keep the source of truth in a real database. Do that and Elasticsearch is fast, durable enough for derived data, and boring — which is exactly what you want from infrastructure.

Appendix: inverted indexes & relevance

If the body assumed search fundamentals you’d like restated:

Inverted index — instead of mapping documents to their words (a forward index), it maps each term to the list of documents that contain it (a postings list). “Find every doc with timeout” becomes a single lookup plus a list walk, not a scan of every document.
Tokenization & normalization — splitting text into terms and folding case, accents, and word forms (stemming) so Mice, mouse, and mice collapse to a common term that matches at query time.
TF-IDF / BM25 — relevance scoring weighs how often a term appears in a document (term frequency) against how rare it is across the whole corpus (inverse document frequency), so a match on a rare word counts for more than a match on a common one. BM25 (the modern default) adds saturation and length normalization so long documents and repeated words don’t dominate.
Near-real-time — a document is durable (in the translog) before it is searchable (after refresh). The gap between those two events is the property that lets Elasticsearch sustain high indexing throughput, and the reason it can’t promise read-your-writes within that window.

Incidents & deep-dives

Where this system breaks in production — and how it comes back.

Documenting next

🔒 Split-Brain: Two Masters, One Clusterroadmap →