PostgreSQL

A deep, production-grade tour of PostgreSQL: MVCC and bloat, VACUUM and the wraparound cliff, the planner, WAL and checkpoints, the process-per-connection model, replication, isolation, and scaling.

17 min readupdated 2026-06-28

On this page

PostgreSQL has a reputation as the database that “just works.” For a single node under modest load, it earns it. The trouble starts when teams carry that assumption into production at scale and discover that Postgres makes one specific, load-bearing tradeoff that touches everything downstream: an UPDATE does not overwrite a row. It writes a new version and leaves the old one behind.

Internalize that single sentence and most Postgres “mysteries” dissolve. Table bloat, why VACUUM exists, why one forgotten transaction can wreck performance on tables it never touched, why a connection count that looks harmless melts the server, why a query that ran in 2ms for a year suddenly takes 8 seconds — they are all downstream of how Postgres stores data and reclaims it.

This is the long-form context article. It is meant to be the thing you wish someone had handed you before your first Postgres incident. It leans on Database Replication for the read-scaling story, Database Indexing for the index deep-dive, and Caching Strategies for what you put in front of it. The appendix at the bottom is a quick ACID/SQL refresher if you want the fundamentals first.

A motivating failure

A payments team runs Postgres as their system of record. For eighteen months it is boring in the best way: p99 query latency around 3ms, CPU under 30%, nobody thinks about the database.

Then a new analytics job ships. It opens a transaction, runs a long report, and — because of an ORM default nobody noticed — leaves the connection idle in transaction for forty minutes while it streams results to a slow downstream. Nothing errors. But during those forty minutes, VACUUM cannot reclaim a single dead row anywhere in the database, because one open snapshot might still need to see old versions.

Writes continue. The hot payments table churns. Dead tuples pile up behind the held snapshot. By the time the report finishes, the table has tripled in physical size, every index scan reads three times the pages it used to, the buffer cache thrashes, and p99 has climbed from 3ms to 400ms. The on-call engineer sees “database slow,” restarts the primary in a panic, the connection pool stampedes back all at once, and a latency problem becomes a full outage.

Nothing in that story is a bug. Every component did exactly what it was configured to do. The outage lived entirely in understanding — in not knowing that a single idle transaction is a database-wide brake. That is the kind of failure this article exists to prevent.

The one-sentence mental model

PostgreSQL gives every transaction a stable, consistent snapshot by keeping multiple physical versions of each row (MVCC), tagging each version with the transaction IDs that created and ended it — so readers never block writers and writers never block readers — at the cost of accumulating dead versions that a background process must continuously reclaim.

Every clause is an operational constraint:

Multiple physical versions → storage grows with churn, not just with data. A 100 MB table under heavy update can occupy 1 GB on disk.
Tagged with transaction IDs → those IDs are a finite 32-bit counter that can wrap around, and preventing that is part of VACUUM’s job, not an optional chore.
Readers never block writers → the headline feature, and the reason Postgres feels so smooth until it doesn’t.
A background process must reclaim them → autovacuum is part of the correctness path, not housekeeping you can defer.

flowchart LR
  U[UPDATE row id=7] --> N[write NEW tuple\nxmin = current txid]
  U --> O[mark OLD tuple\nxmax = current txid]
  O --> D[old tuple is dead\nstill on disk]
  D --> V[autovacuum reclaims\nspace for reuse]
  N --> H[heap page\nmay split = bloat]

A tuple is one physical version of a row. Each carries xmin (the transaction that created it) and xmax (the transaction that superseded or deleted it). A transaction sees a tuple only if xmin is committed and visible to its snapshot and xmax is not. That visibility check, run per tuple, is the whole trick.

How storage actually works

Pages, tuples, and TOAST

Postgres stores tables as an array of fixed-size 8 KB pages (the “heap”). Each page holds a handful of tuples plus a line pointer array. A row never spans pages directly; instead, large field values (big text, bytea, jsonb) are compressed and pushed out-of-line into a TOAST table, with a pointer left in the main tuple.

Two practical consequences fall out of this layout. First, very wide rows waste space and I/O because even reading one column drags the whole tuple (and possibly a TOAST fetch) into cache. Second, HOT updates (Heap-Only Tuples) are an important optimization: if an update doesn’t touch any indexed column and the new tuple fits on the same page, Postgres can chain it without touching every index — far cheaper. Designing tables so hot-path updates stay HOT-eligible is a real, measurable win on write-heavy tables.

MVCC: snapshots without locks

When you UPDATE, Postgres sets xmax on the current tuple (marking it dead to future transactions) and writes a fresh tuple with a new xmin. DELETE only sets xmax. The old versions remain on disk, visible to any older snapshot that still needs them, until VACUUM proves no one does.

sequenceDiagram
  participant T1 as Txn A (reader)
  participant DB as Heap
  participant T2 as Txn B (writer)
  T1->>DB: SELECT row 7 (snapshot @ txid 100)
  T2->>DB: UPDATE row 7 (txid 105)
  Note over DB: old tuple xmax=105\nnew tuple xmin=105
  T1->>DB: SELECT row 7 again
  Note over T1: still sees OLD version\nsnapshot is stable
  T2->>DB: COMMIT
  Note over DB: new version visible\nto txns after 105 only

This is why long-running readers are dangerous: the snapshot’s horizon pins every tuple newer than it as potentially-needed, database-wide. VACUUM respects the oldest live snapshot, so one slow transaction defers cleanup everywhere, not just on the tables it queried.

The cost: bloat

If dead tuples accumulate faster than they’re reclaimed, you get bloat — a table or index that is mostly dead space. Symptoms: physical size grows while live row count is flat, sequential scans slow down, and the buffer cache fills with garbage pages. The two classic causes are (1) update-heavy tables where autovacuum is tuned too conservatively, and (2) a long-lived or idle in transaction session holding back the cleanup horizon.

You can measure it directly:

SELECT relname, n_live_tup, n_dead_tup,
       round(n_dead_tup::numeric / nullif(n_live_tup,0), 3) AS dead_ratio
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 10;

A dead_ratio consistently above ~0.2 on a large table means vacuum isn’t keeping up. The fix is either more aggressive autovacuum (below) or, for tables already bloated, an online repack with pg_repack.

VACUUM, autovacuum, and the wraparound cliff

autovacuum runs VACUUM automatically once a table’s dead-tuple count crosses a threshold — by default autovacuum_vacuum_scale_factor = 0.2, i.e. 20% of the table must churn first. On a 50-million-row hot table that’s ten million dead tuples before cleanup even starts, which is far too late. Per-table tuning is one of the highest-impact knobs in Postgres:

-- per-table, for a hot 50M-row table
ALTER TABLE payments SET (
  autovacuum_vacuum_scale_factor = 0.02,  -- vacuum at 2% churn, not 20%
  autovacuum_vacuum_cost_limit   = 2000   -- let the worker do more per round
);

VACUUM has a second job that is not negotiable: freezing old tuples to prevent transaction-ID wraparound. Transaction IDs are a 32-bit counter (~4.2 billion values) compared in modular arithmetic. To keep ancient tuples visible “forever,” Postgres marks them frozen once they age past a threshold. If autovacuum falls behind and a database’s oldest unfrozen transaction approaches ~2 billion in age, Postgres protects itself from silent corruption by refusing new writes and dropping into a single-user recovery mode.

flowchart TD
  W[high write rate\ntxid climbing] --> A{autovacuum\nfreezing keeping up?}
  A -->|yes| OK[healthy\nage stays bounded]
  A -->|no| C[age datfrozenxid\nclimbs toward 2B]
  C --> WARN[wraparound warnings]
  WARN --> RO[writes refused\nread-only mode]
  RO --> FIX[run VACUUM FREEZE\nto recover]
  style RO fill:#e11d48,color:#fff
  style C fill:#171717,color:#fff

Watch it continuously and alarm early:

SELECT datname, age(datfrozenxid) AS xid_age
FROM pg_database ORDER BY xid_age DESC;

Page well before 1.5 billion. Wraparound is rare but catastrophic, and it essentially always means autovacuum was disabled “temporarily” months ago, or a table was configured to skip it. The recovery — a manual VACUUM (FREEZE) while writes are blocked — can take hours on a large table, which is exactly when you can least afford downtime.

Most teams use exactly one index type and wonder why some queries are still slow. Postgres ships a whole menu; picking the right one changes cost by orders of magnitude. The full mechanics live in Database Indexing; here is the operator’s summary.

Index	Best for	Watch out for
B-tree (default)	Equality + range on ordered, high-cardinality columns	Useless for “contains” / array / jsonb queries
GIN	Array membership, `jsonb @>`, full-text search	Slower writes; larger; tune `fastupdate`
GiST	Geometric / range / nearest-neighbor (PostGIS)	More specialized; lossy for some types
BRIN	Naturally ordered, append-only data (time-series `created_at`)	Useless if physical order is random
Partial	Indexing only the rows you query (`WHERE active`)	Predicate must match query
Covering (`INCLUDE`)	Index-only scans that avoid the heap	Wider index, more maintenance

The BRIN case is the most dramatic. On a 1 TB append-only time-series table, a B-tree on created_at might be 30 GB and compete with the heap for cache; the BRIN equivalent is a few megabytes because it stores only the min/max per block range. The catch is it only works if rows are physically clustered by that column — which append-only inserts give you for free.

Two rules that prevent most index pain: composite index column order matters (the leftmost-prefix rule — an index on (a, b) helps WHERE a= and WHERE a= AND b=, not WHERE b= alone), and every index taxes every write, so an unused index is pure cost. Find dead weight with pg_stat_user_indexes (idx_scan = 0 over a long window).

The planner and EXPLAIN

Postgres is cost-based: the planner estimates the cheapest plan from statistics gathered by ANALYZE (autovacuum runs it too). When stats are stale, estimates drift, and the planner can flip from an index scan to a sequential scan — a 2ms query becomes an 8s query with no code change.

EXPLAIN (ANALYZE, BUFFERS) is the tool, and the single most important thing to read is the gap between estimated and actual rows:

Nested Loop  (cost=... rows=1 ...) (actual ... rows=2000000 ...)

An estimate of rows=1 that actually returns two million means the planner was lied to by stale statistics and chose a nested loop where it should have chosen a hash join. The fix is usually fresh ANALYZE, sometimes a higher default_statistics_target on a skewed column, occasionally extended statistics (CREATE STATISTICS) for correlated columns.

Read performance

Reads are fast when the working set fits in cache and the planner picks the right access path. The levers, in rough order of impact:

shared_buffers — the database’s own cache (start around 25% of RAM). Below it, the OS page cache also holds data, so Postgres benefits from both.
The right index — an index-only scan that never touches the heap is the gold standard; design covering indexes for hot read paths.
Connection discipline — every active query competes for CPU and buffers; fewer, pooled connections often raise throughput (see below).
Replicas for read fan-out — once a single node’s reads saturate, route read-only traffic to streaming replicas, accepting replication lag for read-after-write.

Write performance: WAL, checkpoints, fsync

Every change is written to the Write-Ahead Log (WAL) before the data pages themselves. The WAL is the durability contract: on crash, Postgres replays it to recover committed work. It is also what streaming replicas consume.

flowchart LR
  C[COMMIT] --> WAL[append to WAL\nfsync to disk]
  WAL --> ACK[client acked durable]
  WAL --> BUF[dirty pages in\nshared_buffers]
  BUF --> CKPT[checkpoint flushes\ndirty pages to heap]
  WAL --> REP[stream to replicas]

Two write-path realities bite in production. Checkpoints periodically flush all dirty buffers to the heap; if they bunch up, you get a “checkpoint storm” — an I/O spike that stalls queries. Spread them out by raising max_wal_size and checkpoint_completion_target. And fsync honesty matters: synchronous_commit = off makes commits faster by not waiting for WAL flush, at the cost of losing the last fraction of a second of transactions on crash — acceptable for some workloads, a silent data-loss footgun for others. Decide deliberately.

The connection model (and why PgBouncer is mandatory)

Here is the surprise that takes down more Postgres deployments than anything else: Postgres uses one OS process per connection, not a thread pool. Each backend costs 5–10 MB of memory and scheduler attention, and shared structures like the lock table contend as connections grow. Past a few hundred active connections, throughput drops — more connections make the server slower, not faster.

The fix is a pooler. PgBouncer in transaction mode multiplexes thousands of client connections onto a small server pool (often 20–50), lending a server connection to a client only for the duration of a transaction:

[pgbouncer]
pool_mode = transaction
max_client_conn = 5000
default_pool_size = 25

The caveat: transaction mode breaks session-scoped features — session SET, advisory locks, WITH HOLD cursors, and (unless configured) server-side prepared statements. Know what your app relies on before flipping the mode. Cap max_connections on the server low (100–200) so the pooler, not the database, absorbs surges.

Replication & high availability

Replication is via WAL streaming: the primary ships its WAL to replicas, which replay it. By default this is asynchronous, so replicas lag — a read against a replica immediately after a write to the primary can return stale data (the read-your-writes problem). synchronous_standby_names makes a commit wait for a replica to acknowledge, trading write latency for a tighter durability/RPO guarantee.

Failover is not automatic in core Postgres; you run a tool (Patroni, repmgr, or a managed service) that detects primary failure, promotes a replica, and — critically — fences the old primary so two nodes don’t both accept writes (split-brain). The full topology discussion is in Database Replication; the one-line takeaway is that your failover loss window equals your replication lag, so measure it before you need it.

Consistency & isolation levels

Postgres defaults to READ COMMITTED: each statement sees a fresh snapshot, so two statements in one transaction can observe different data. REPEATABLE READ pins one snapshot for the whole transaction (and, in Postgres specifically, also rules out many anomalies weaker databases permit here). SERIALIZABLE uses predicate locking (SSI) to guarantee transactions behave as if executed one at a time — but it can abort a transaction with a serialization_failure that your application must catch and retry.

The trap: choosing SERIALIZABLE for safety without adding retry logic trades subtle correctness bugs for loud availability bugs. Pick the weakest level that is actually correct for the operation, and wrap serializable transactions in a retry loop. For deeper background on the underlying tradeoff, see Consistency & Consensus and CAP Theorem & Tradeoffs.

Scaling: vertical → replicas → partitioning → sharding

Vertical first. A single node with fast NVMe and enough RAM goes much further than people expect. Right-size shared_buffers, fix vacuum, kill bad queries before you reach for distribution.
Read replicas for read fan-out (accepting lag).
Declarative partitioning when one table grows unwieldy: split it by range or list into child tables so each table’s indexes and vacuum cost stay manageable. Partition pruning then skips irrelevant children at plan time.
Sharding across multiple clusters (application routing or Citus) once a single primary’s write ceiling is the wall. At that point you have left “just Postgres” and signed up for a distributed system’s sharding and consistency tradeoffs.

Failure modes

Connection storm. A spike or pool reset sends every app instance reconnecting at once; processes pile up, memory and CPU saturate, latency climbs, the app opens more connections to compensate, and it spirals into FATAL: sorry, too many clients already. Prevention: PgBouncer + low max_connections.
Vacuum debt → wraparound. Autovacuum throttled or disabled on a high-churn table; bloat grows, then txid age climbs toward the read-only cliff. Prevention: aggressive per-table autovacuum + age(datfrozenxid) alarms.
Planner flip. Stale stats after a bulk load or distribution shift turn a fast query slow. Prevention: ANALYZE after big changes; raise statistics targets on skewed columns.
Checkpoint storm. Bunched checkpoints spike I/O and stall queries. Prevention: larger max_wal_size, checkpoint_completion_target near 0.9.
Idle-in-transaction brake. One held snapshot blocks vacuum globally (the opening story). Prevention: idle_in_transaction_session_timeout and alarms on long transactions.

Almost every serious Postgres incident is one of two stories: connections treated as free when each is a process, or vacuum treated as optional when it is part of correctness. Fix those two beliefs and you have prevented most outages before they happen.

When to reach for it (and when not to)

Reach for PostgreSQL as your default system of record. Strong consistency, real transactions, rich SQL, jsonb for semi-structured data, mature replication, and an enormous extension ecosystem (PostGIS for geo, pgvector for embeddings, full-text search built in) cover the overwhelming majority of backends. If you are unsure which database to pick, the honest answer is usually Postgres.

Don’t reach for it when you need single-digit-millisecond key-value access at internet scale with near-zero ops — that is a managed key-value store’s job (DynamoDB). Don’t make it your primary message bus (Kafka and dedicated brokers do that better), and don’t run heavy analytical scans against your transactional primary — replicate into a columnar warehouse instead.

When to consider alternatives

Massive write throughput, linear scale-out, tunable consistency → Cassandra or DynamoDB.
Search and relevance ranking → Elasticsearch (as a secondary index, not the source of truth).
Sub-millisecond ephemeral state, counters, rate limits → Redis.
Event log / streaming backbone → Kafka.
Horizontal Postgres specifically → Citus, but accept the distributed-systems tax.

Operational checklist

Put PgBouncer (transaction mode) in front of any multi-instance app; cap max_connections at 100–200.
Alarm on age(datfrozenxid) per database; page well before 1.5 billion.
Tune autovacuum_vacuum_scale_factor down (0.01–0.05) on large hot tables; never disable autovacuum globally.
Set idle_in_transaction_session_timeout; alarm on transactions open longer than a few minutes.
Monitor n_dead_tup and bloat; pg_repack tables autovacuum can’t keep up with.
Run ANALYZE after bulk loads; raise default_statistics_target on skewed columns; review with EXPLAIN (ANALYZE, BUFFERS).
Spread checkpoints (max_wal_size, checkpoint_completion_target); watch checkpoint I/O.
Track replication lag (pg_stat_replication); route read-after-write to the primary.
Drop unused indexes (idx_scan = 0); every index taxes every write.

Summary

PostgreSQL is the best default database for most systems, and almost all of its sharp edges trace back to two design facts: it keeps old row versions (so you must vacuum) and it uses one process per connection (so you must pool). Learn MVCC, keep autovacuum ahead of churn, watch the wraparound counter, put a pooler in front, read your EXPLAIN output, and scale vertically before you distribute. Do that and Postgres will be boring — which, for a system of record, is the highest compliment there is.

Appendix: ACID & SQL refresher

If the body assumed fundamentals you’d like restated:

Atomicity — a transaction is all-or-nothing; a failure rolls back every change.
Consistency — every committed transaction moves the database from one valid state to another (constraints, foreign keys, triggers hold).
Isolation — concurrent transactions don’t see each other’s half-finished work; the isolation level sets how strictly (see above).
Durability — once committed, data survives a crash, guaranteed here by the WAL + fsync.

A relational database models data as tables (relations) of rows (tuples) and columns (attributes), with keys (primary/foreign) expressing identity and relationships. SQL is the declarative language for it: you describe what you want (SELECT … WHERE … JOIN … GROUP BY) and the planner decides how to get it. That separation of “what” from “how” is exactly why the planner — and the statistics that feed it — matters so much in practice.

Incidents & deep-dives

Where this system breaks in production — and how it comes back.

Documenting next

🔒 Connection Storm: When PgBouncer Saves Youroadmap →
🔒 Transaction ID Wraparound: The Clock Runs Outroadmap →