Celery

Celery in production: a distributed task queue with at-least-once delivery and an invisible backlog. Brokers, acks_late, prefetch, visibility timeouts, and the backpressure you can't see.

19 min readupdated 2026-06-28

On this page

Celery gets introduced as two lines of code: @app.task over a function and .delay() at the call site. That framing survives exactly until the first time a queue silently grows to four million messages while every dashboard you own stays green. Celery is a distributed task queue with at-least-once delivery and almost nothing else guaranteed. Every word in that phrase is a decision you have to make on purpose, because if you don’t, the defaults will make it for you — and they optimize for a five-task demo, not for the night your downstream API starts timing out.

The thing Celery does not give you is a queue you can see. The broker holds your backlog, the workers drain it, and the gap between “tasks produced” and “tasks consumed” is invisible unless you go looking for it. That gap is where most Celery outages live, and it is the part no tutorial mentions.

This is the baseline context article — enough to reason about Celery under real load, not just under a happy-path test. It leans on Redis (the most common broker and the natural home for idempotency keys), on Message Queues for the general delivery-semantics theory, and on Observability for the one metric that actually matters here. If you want the queueing fundamentals restated, there’s an appendix at the bottom.

A motivating failure

A mid-size SaaS sends transactional email — password resets, receipts, export-ready notifications — through Celery on a Redis broker. For a year it is invisible infrastructure. Then marketing ships a feature: every signup now triggers a welcome sequence, and a nightly job enqueues a “we miss you” email for dormant users. Volume goes from ~5k tasks/hour to ~200k.

Nothing breaks loudly. But the email provider has a per-second send cap, and the workers — eight processes, default worker_prefetch_multiplier=4 — start spending most of their time blocked on the provider’s rate-limit responses. Throughput drops below the enqueue rate. The Redis list backing the queue starts to grow: 10k, then 80k, then 600k messages. CPU on the workers sits at 15%, so the host dashboards are green. The autoscaler, which scales on CPU, does nothing.

Support tickets arrive before any alert does: “I reset my password an hour ago and never got the email.” On-call looks at worker CPU (fine), at Redis memory (climbing, but not yet critical), and finally — three hours in — runs redis-cli LLEN celery and sees 1,900,000. The backlog is now so deep that even at full send rate it will take six hours to drain, and half those welcome emails are for sessions that already expired.

Nothing in that story is a bug. The workers worked. Redis worked. The provider worked. The outage lived entirely in the gap nobody graphed: tasks produced minus tasks consumed, integrated over time. That number was the outage, and it was invisible the whole way down.

The one-sentence mental model

Celery serializes a function call into a message, pushes it onto a broker, and trusts that some worker will eventually pull it off and run it — at least once, in no guaranteed order, with a result you usually shouldn’t wait on.

Unpack each clause, because each is an operational constraint:

Serializes a function call into a message → args must be serializable (JSON by default since Celery 4); you pass an order_id, never an ORM object.
Pushes it onto a broker → the broker is now a stateful system you have to size, persist, and monitor. Your backlog lives there, not in your app.
Some worker will eventually pull it → the producer never talks to the worker. Decoupling is the whole point and the whole danger: backpressure is silent.
At least once → your task will run twice eventually. Idempotency is not optional; it’s the contract.
No guaranteed order → redelivery and multiple queues reorder work. Don’t build sequencing on top of it.
A result you shouldn’t wait on → blocking on .get() in a web request reintroduces the coupling you paid Celery to remove.

flowchart LR
  P[Producer\ntask.delay] --> B[(Broker\nRedis / RabbitMQ)]
  B --> W1[Worker 1\nN child procs]
  B --> W2[Worker 2]
  W1 --> R[(Result backend\noptional)]
  W2 --> R
  P -. poll AsyncResult .-> R

Two stores, two jobs, and conflating them is the first mistake. The broker is the work queue — it holds messages waiting to run, and it must be durable enough not to lose pending work. The result backend is a separate key-value store that holds return values and task state; it is throwaway, and you should treat it as a cache with a TTL. They have different lifetimes, different durability needs, and frequently should be different systems entirely.

How it actually works

A .delay() call runs nothing. It encodes the task name, args, and kwargs into a message and hands it to the broker, then returns an AsyncResult immediately. The function body executes later, on some worker, maybe. Internalizing that “maybe” is the difference between using Celery and being surprised by it.

Broker vs result backend

Pick the broker for its delivery semantics, not for what you already run.

Broker	Delivery model	Durability	The catch
RabbitMQ	Native AMQP queues, real acks	Persistent queues survive restart	Heavier to operate; the “correct” choice Celery was built around
Redis	List push/pop emulating a queue	Only as durable as your Redis persistence	Visibility-timeout redelivery is fragile; easy to start, easy to lose work
SQS	Managed, long-polled	Durable, at-least-once	No native priorities, no remote-control commands, polling latency

The result backend is independent of the broker and frequently misused. Using Redis as a result backend at thousands of tasks/sec creates one key per task that lingers until its TTL — a silent memory leak if result_expires isn’t set. The honest default for most systems is ignore_result=True: the overwhelming majority of tasks are fire-and-forget and have no return value anyone reads.

app.conf.update(
    task_serializer="json",
    accept_content=["json"],
    result_expires=3600,        # results die after an hour, no leak
    task_ignore_result=True,    # opt in per-task where you actually need it
)

Acks, visibility timeout, and redelivery

This is the part people get wrong, and it decides whether you lose work or duplicate it. By default, Celery acknowledges a message before the task runs (task_acks_late=False). The worker pulls the message, immediately tells the broker “got it, delete it,” then executes. If that worker is SIGKILLed or the box dies mid-task, the work is gone — acked and deleted, never completed, no trace.

Setting task_acks_late=True flips the order: the message is acked only after the task returns. A crash mid-task now leaves the message on the broker, and it gets redelivered. This is correct for any task you can’t afford to lose — but it only makes sense paired with idempotent tasks, because redelivery means double execution by design.

sequenceDiagram
  participant P as Producer
  participant B as Broker
  participant W as Worker
  P->>B: publish task (acks_late=True)
  B->>W: deliver, start visibility timer
  Note over W: task runs 90s
  W--xW: worker SIGKILLed at 60s
  Note over B: timer expires, redeliver
  B->>W: redeliver to another worker
  Note over W: runs again, idempotency saves you

With the Redis broker there are no real AMQP acks, so Celery emulates them with a visibility timeout (visibility_timeout, default 3600 seconds). When a worker takes a message, Redis hides it for that window. If the worker doesn’t ack within the window, Redis assumes it died and redelivers. The trap is brutal in its simplicity: if a legitimate task runs longer than visibility_timeout, Redis redelivers it to a second worker while the first is still running it. The same task now executes concurrently, twice — not as an edge case, but as steady-state behavior. Long-running tasks on a Redis broker must have a visibility timeout safely longer than the slowest task, or you’ve built duplicate execution into the foundation.

Idempotency is not optional

At-least-once means every task must tolerate running more than once with the same input. The pattern is a stable dedup key derived from the arguments, plus a check-and-set before the side effect:

@app.task(bind=True, acks_late=True, max_retries=5)
def charge_order(self, order_id):
    # SET NX returns None if the key already exists
    if redis.set(f"charged:{order_id}", "1", nx=True, ex=86400) is None:
        return  # redelivery of work already done — no-op
    payments.charge(order_id)

Without that guard, acks_late plus a worker restart equals double charges. The delivery semantics are the application’s contract to honor, not something a Celery flag will fix for you. This is the same discipline you’d apply consuming from Kafka or SQS — at-least-once systems push exactly-once up into your code.

Prefetch multiplier

Each worker process fetches worker_prefetch_multiplier × concurrency messages up front and holds them in memory before running any. The default is 4. For short, uniform tasks this is a throughput win — no broker round trip per task. For long or uneven tasks it’s a disaster: one worker grabs a batch of slow tasks, reserves the rest unstarted, and idle workers can’t pull them because they’re already spoken for. You see workers pinned at 100% next to workers doing nothing, with a queue that won’t drain.

flowchart TD
  Q[Queue\nmixed durations] --> PF{prefetch = 4}
  PF --> W1[Worker A\n1 slow + 3 reserved]
  PF --> W2[Worker B\nidle, nothing left]
  W1 --> S[3 tasks wait\nbehind slow one]
  W2 --> I[capacity wasted]
  S --> FIX[set prefetch = 1\nbroker spreads work]
  I --> FIX

The fix for long or uneven tasks is worker_prefetch_multiplier=1 with task_acks_late=True: a worker holds exactly one message at a time and the broker distributes work evenly. You give up a little latency on tiny tasks to stop slow tasks from starving the pool.

The tradeoffs that bite

These are the decisions that look free in the tutorial and aren’t.

Decision	Looks free	Actually costs
`acks_late=False` (default)	Faster, simpler	Silent task loss on worker crash
`acks_late=True`	No lost work	Double execution — needs idempotency
`prefetch_multiplier=4` (default)	Higher throughput	Long tasks starve idle workers
Redis broker	One-command setup	Visibility-timeout redelivery on slow tasks
Result backend on	”I can check status”	Per-task key, memory leak without `result_expires`
`task_time_limit` unset	Tasks never killed	One hung task pins a worker forever
Big serialized args	Convenient	Broker bloat; pass IDs, not objects

The recurring theme: Celery’s defaults optimize for the demo — fast, simple, no data loss visible in a five-task test — and against production, where durability, even work distribution, and bounded resource use are the whole game. You override the defaults deliberately, or you inherit the matching failure mode.

Throughput and latency: where the time goes

Celery has two performance numbers that people conflate and shouldn’t: task latency (how long one task takes end to end, including time waiting in the queue) and drain rate (how many tasks/sec the pool clears). A backlog is a drain-rate problem; a slow individual task is a latency problem; and they call for opposite fixes.

The levers, in rough order of impact:

Pool type matched to the work. The default prefork pool runs CPU-bound tasks on separate processes — right when tasks burn CPU, wrong when they sit on network I/O. For I/O-bound tasks (calling APIs, waiting on a database), the gevent or eventlet pools let one process juggle hundreds of concurrent tasks while they wait. A worker doing --pool=gevent --concurrency=200 can hold 200 in-flight HTTP calls on one core; the same workload on prefork would need 200 processes.
Concurrency sized to the bottleneck. CPU-bound: --concurrency ≈ core count. I/O-bound: far higher, because the work is waiting, not computing. Over-provisioning prefork concurrency on a CPU-bound queue just adds context-switching overhead and memory pressure.
Prefetch tuned to task shape (above). The single biggest cause of “we have spare workers but the queue won’t drain.”
Result backend off where you don’t read results. Writing a result for every task at high throughput adds a broker/backend round trip and the memory cost of a key. task_ignore_result=True removes both.
Batching at the edges. Ten thousand send_email.delay() calls in a loop is ten thousand broker round trips from the producer. If the producer is on the request path, that’s real latency. Enqueue in groups, or hand the loop itself to a single task.

A concrete baseline: a Redis-brokered Celery pool doing trivial tasks (increment a counter) clears tens of thousands of tasks/sec across a handful of workers. The same pool doing tasks that each make a 200ms external call clears concurrency / 0.2 tasks/sec per worker on a gevent pool — so 200 concurrency gives ~1000 tasks/sec per worker, bounded entirely by the downstream, not by Celery. When the downstream is the limit, adding Celery workers does nothing but deepen the queue faster; the fix is rate limiting at the worker and backpressure at the producer.

Failure modes

Celery breaks in a small number of recognizable ways. Learn the symptoms and you’ll diagnose in minutes instead of hours.

The invisible backlog. Symptom: users report stale or missing results (“my export from an hour ago is still pending”) while every infra dashboard is green. Root cause: producers enqueue faster than workers drain — a deploy cut concurrency, tasks got slower, or a downstream started timing out — and the only place the deficit shows up is broker queue depth, which almost nobody graphs. Prevention: alert on queue depth and its slope, not worker CPU (see the blockquote). This is the signature Celery outage and the motivating story above.

If you monitor one thing about Celery, monitor broker queue depth over time, not worker CPU. A healthy system has queue depth oscillating near zero. A depth with a positive slope that never returns to baseline is an outage in progress, and it is silent. Alert on depth > threshold and on depth increasing for N minutes — the second catches the slow bleed the first misses.

Duplicate execution. Symptom: side effects happen twice — double emails, double charges, duplicate rows. Root cause: visibility timeout shorter than the slowest task on a Redis broker, or acks_late=True paired with non-idempotent tasks. Prevention: visibility timeout longer than your slowest task, and a dedup key on every task with an external side effect.

Worker starvation from prefetch. Symptom: some workers at 100%, others idle, queue not draining despite apparent spare capacity. Root cause: long tasks plus default prefetch_multiplier=4 reserving messages that idle workers can’t touch. Prevention: worker_prefetch_multiplier=1 for long/uneven queues.

The hung-task worker leak. Symptom: effective concurrency silently drops to near zero while the process count looks healthy. Root cause: a task blocks forever on a socket read with no timeout; with no task_time_limit, that worker process never returns to pull more work, and enough of them strand the whole pool. Prevention: always set task_time_limit and task_soft_time_limit, and put timeouts on every network call inside tasks.

Poison messages. Symptom: a tight crash-redeliver loop burning a worker, queue never fully drains. Root cause: a task that crashes deterministically on a bad payload, with acks_late=True, gets redelivered forever. Prevention: bound max_retries, and route exhausted messages to a dead-letter queue instead of letting them recirculate.

Result backend exhaustion. Symptom: Redis (or whatever backs results) climbs to eviction or OOM. Root cause: result backend on, result_expires unset, high throughput — the per-task keys never expire. Prevention: set result_expires, or ignore_result=True for tasks whose return value nobody reads.

Scaling it

At low volume, one worker pool against one broker is fine. The walls show up in a predictable order, and each has a specific move.

Separate queues by workload shape. The first scaling move is routing. Put fast tasks (send email) and slow tasks (generate a report) on different queues drained by different worker pools, so one slow task type can never starve a fast one. Map task names to queues with task_routes, then start dedicated workers:

app.conf.task_routes = {
    "emails.send":      {"queue": "fast"},
    "reports.generate": {"queue": "slow"},
}
# celery -A app worker -Q fast --concurrency=16 --pool=gevent
# celery -A app worker -Q slow --concurrency=4  --pool=prefork

Right-size concurrency and pool to the work (covered in performance): prefork for CPU, gevent/eventlet for I/O, never one shared pool straddling both.

Autoscale on queue depth, not CPU. Worker CPU is a lagging, misleading signal — the motivating outage had idle CPU the entire time. Scale workers on broker queue depth and oldest-message age. A backlog with idle CPU means you’re blocked downstream and adding workers won’t help, but you want to see that, and depth-based scaling forces you to graph it.

Graduate the broker. Redis as a broker is fine into the low thousands of tasks/sec if persistence and visibility timeout are set correctly. Past that — or when you need real priorities, durable queues, and proper acks — move to RabbitMQ. At cloud scale where you don’t want to operate a broker at all, SQS trades features (no remote control, no priorities) for being someone else’s problem.

Shard by tenant or priority. Very large systems run multiple broker instances or RabbitMQ vhosts to isolate blast radius, so one tenant’s runaway backlog can’t drown everyone else’s. This is the same instinct as sharding a database: partition the resource so failure is contained.

When to reach for it (and when not to)

Reach for Celery when you have a Python app and need to move work off the request path: sending email, generating reports, processing uploads, scheduled jobs (celery beat), or fanning out work to many workers. It’s the mature, batteries-included default for background jobs in Python, and the surrounding ecosystem — Flower for monitoring, retries with backoff, routing, beat for scheduling — is real and battle-tested.

Don’t reach for it when you need exactly-once semantics; Celery is at-least-once and no flag changes that — you build idempotency instead. Don’t use it when you need strict ordering across tasks; redelivery and multiple queues reorder work. Don’t use it as an event-streaming backbone or a durable, replayable log — that’s Kafka, not a task queue. And don’t lean on celery beat for sub-second scheduling precision; it’s a cron, not a real-time scheduler. For “just run this after the response returns,” a lighter library like RQ or Dramatiq has far fewer knobs to set wrong.

When to consider alternatives

Durable, replayable, ordered event log → Kafka. A task queue forgets a message once it’s acked; a log keeps it.
General queue theory, delivery semantics, dead-letter design → Message Queues for the vendor-neutral model behind all of this.
Idempotency keys, dedup sets, rate-limit windows → Redis, the data structures that make at-least-once safe.
Throttling downstream fan-out from workers → Rate Limiting.
Seeing the invisible backlog at all → Observability: queue depth, oldest-message age, drain rate as first-class metrics.
Coordinating worker leadership / locks across a fleet → ZooKeeper or a Redis lock, depending on the guarantees you need.

Operational checklist

Graph and alert on broker queue depth and oldest-message age, separately from worker CPU. This is the one that saves you.
Set task_acks_late=True for any task you can’t lose, and make every such task idempotent with a dedup key.
For long or uneven tasks, set worker_prefetch_multiplier=1 so slow work spreads across the pool.
Set task_time_limit and task_soft_time_limit so a hung task can’t pin a worker forever; put timeouts on every network call inside tasks.
On the Redis broker, set visibility_timeout longer than your slowest task to avoid concurrent duplicate execution.
Set result_expires (or task_ignore_result=True) so the result backend doesn’t leak memory.
Configure max_retries with exponential backoff (retry_backoff=True, retry_jitter=True) and a dead-letter path for poison messages.
Route fast and slow tasks to separate queues with separate worker pools; match pool type (prefork vs gevent) to CPU- vs I/O-bound work.
Pass small serializable args (IDs, not ORM objects) to keep broker messages small and replayable.

Summary

Celery is the default way to run background work in a Python app, and almost every sharp edge traces back to two facts: it delivers at least once (so your tasks must be idempotent), and your backlog is invisible (so you must graph broker queue depth, not worker CPU). Choose the broker for its delivery semantics, flip acks_late on for work you can’t lose, set prefetch=1 for long tasks, bound every task with a time limit, give the result backend a TTL, and route fast and slow work to separate pools. Do that and Celery is boring background infrastructure. Skip it and you’ll meet each failure mode in production, one page at a time, usually starting with a queue depth nobody was watching.

Appendix: task-queue fundamentals

If the body assumed concepts you’d like restated:

Producer / consumer — the producer enqueues work and moves on; the consumer (worker) dequeues and runs it. They never talk directly; the broker sits between them, which is what lets them scale and fail independently.
At-least-once vs at-most-once vs exactly-once — at-most-once may drop messages but never repeats (ack before work); at-least-once never drops but may repeat (ack after work); exactly-once is a property you build with idempotency on top of at-least-once, not a delivery mode a broker hands you.
Ack (acknowledgment) — the signal from consumer to broker that a message is handled and can be deleted. When you ack (before or after running the task) is the whole durability story.
Visibility timeout — on brokers without native acks (Redis, SQS), the window during which a taken message is hidden from other consumers. Exceed it with a still-running task and the message reappears, causing duplicate work.
Backpressure — the mechanism by which a slow consumer slows the producer. Celery has none by default: producers enqueue freely while the broker absorbs the deficit, which is exactly why the backlog grows invisibly.

Incidents & deep-dives

Where this system breaks in production — and how it comes back.

Documenting next

🔒 Broker Backpressure & The Invisible Queueroadmap →