Object Storage & Large Blobs
Object storage is not a filesystem. The S3 model, presigned uploads, multipart, lifecycle tiering, CDN fronting, egress economics, and the one rule that prevents most incidents.
On this page
Every engineer stores user uploads the wrong way exactly once. You reach for what’s in front of you — a BLOB column in Postgres, or the local disk of whatever box the request landed on — and it works in the demo. Then the disk fills at 2am, or the database backup balloons to 400 GB because someone uploaded their wedding video, or the second app server can’t see the file the first one wrote, and you discover that “where do the bytes live” is an architectural decision you skipped.
Object storage is the answer, and it is almost universally misunderstood as “a hard drive in the cloud.” It is not a hard drive, it is not a filesystem, and treating it like one is the root cause of most object-storage pain. It is a flat, HTTP-addressable key-value store for immutable blobs — no directories, no in-place edits, latency measured in tens of milliseconds, and durability so high you will retire before you lose an object to hardware.
This article is the mental model for storing large blobs — images, video, backups, data exports, ML artifacts — and the single architectural rule that prevents the majority of incidents: never stream large file bytes through your application tier. Get that one thing right and most of the rest is tuning.
It leans on Caching Strategies for the CDN layer that sits in front of object storage, Load Balancing for the request path, Observability for watching cost and egress as first-class signals, and PostgreSQL for the metadata index that should always accompany a bucket. When the data stops being blob-shaped, the honest alternatives are DynamoDB, Redis, and a real relational store.
A motivating failure
A video startup I consulted for let users upload clips through their API. The flow looked clean: the mobile app POSTed the file to /api/upload, the Node service buffered it, ran a virus scan, then forwarded it to S3. It shipped, it worked, the team moved on.
Six weeks later they ran a promotion. Upload volume went up maybe 8x — not enormous — and the API tier fell over completely. Not slowly. The pods started getting OOMKilled in a tight loop. The Kubernetes scheduler (Kubernetes) dutifully rescheduled them, they pulled traffic, buffered three 600 MB uploads each into a 1 GB memory limit, and died again. A classic crash loop, except the cause was invisible in the dashboards because CPU was fine and the database was fine.
The problem was the architecture, not the load. Every uploaded byte flowed through the application: client → load balancer → app pod (held in memory) → S3. A handful of concurrent large uploads is enough to exhaust any reasonable memory limit when you buffer whole files. The autoscaler couldn’t help — adding pods just gave the herd more places to OOM. And the bandwidth bill for that month was double what it should have been, because they paid ingress from the client and egress to S3 for every single byte.
The fix took an afternoon and deleted code. The app stopped touching bytes entirely. It started minting presigned URLs so the client uploaded straight to S3, and moved the virus scan to an event-triggered function that ran after the object landed. Memory usage on the API tier dropped to a flat line. The lesson burned in: the moment your service is in the data path for large blobs, file size becomes your scaling problem instead of the object store’s — and the object store is built for it while your API tier is not.
The one-sentence mental model
Object storage is a flat namespace of immutable objects, each addressed by a string key and reached over HTTP, engineered for near-infinite durability and parallel throughput but not for low-latency random writes or file-like editing.
Every clause is a constraint you will meet in production:
- Flat namespace — there are no directories.
photos/2024/cat.jpgis one key whose value happens to contain slash characters. “Listing a folder” is really a prefix scan over a sorted key index, and it bills and paginates like one. - Immutable objects — you never edit an object in place; you overwrite the whole key with a new version. There is no “append one byte at offset 50.” An edit is a full read-modify-write of the entire object.
- Addressed by a string key, reached over HTTP — every object is effectively a URL.
GET,PUT,DELETEare HTTP verbs, which is precisely why browsers, mobile clients, and CDNs can talk to the store directly without your code in the middle. - Near-infinite durability, not low latency — S3 advertises eleven nines (
99.999999999%) of durability by replicating each object across multiple devices and facilities, but first-byte latency is tens of milliseconds. It is built to never lose your data and to serve enormous aggregate throughput in parallel, never to be a fast scratch disk.
flowchart LR
CL[Client] -->|GET https| K1
CDN[CDN edge] -->|origin miss| K1
subgraph Bucket [Flat key namespace]
K1[photos/2024/a.jpg] --> V1[(blob plus\nmetadata)]
K2[backups/db.tar] --> V2[(blob plus\nmetadata)]
end
The trap hiding in that sentence is the word “key.” It feels like a path, so people build mental models around folders, mv operations, and directory locks. None of those exist. Renaming a key is a copy-then-delete of the whole object. There is no atomic rename, no transactional move, no cheap “list this directory.” Once you internalize flat, immutable, HTTP, durable-not-fast, the API stops surprising you.
Object vs block vs file
Three storage models, endlessly confused in design reviews:
| Block (EBS, SAN) | File (NFS, EFS) | Object (S3, GCS, R2) | |
|---|---|---|---|
| Unit | Fixed-size blocks | Files in a tree | Whole objects by key |
| Access | Attach to one host, mount | Mount, POSIX semantics | HTTP API, from anywhere |
| Edit in place | Yes | Yes | No — overwrite whole object |
| Scale ceiling | Volume size | Filer capacity | Effectively unlimited |
| Latency | Microseconds | Low milliseconds | Tens of milliseconds |
| Best for | DB and OS disks | Shared app filesystems | Blobs, media, backups, assets |
Reach for block when a database or operating system needs a fast local disk. Reach for file when many hosts need a shared POSIX filesystem with locking and append. Reach for object for anything large, write-once-read-many, and accessed over the network — which describes nearly all user-generated content. The mistake is using the wrong one because it was the one you already knew.
How it actually works
Keys, prefixes, and the partition hiding behind them
S3 keys are flat strings, but the service partitions request throughput internally by key prefix. This used to be a sharp edge: sequential keys like 2024-06-28T10:00:01, ...:02, ...:03 all sorted into the same partition, and a high-volume pipeline would get throttled with 503 SlowDown because one partition was hot while the rest of the keyspace sat idle. The old advice was to prepend a random hash to keys to spread them.
S3 now scales partitions automatically, supporting roughly 3,500 PUT/POST/DELETE and 5,500 GET requests per second per prefix, and it splits hot prefixes over time. But the underlying physics didn’t change — throughput still scales per prefix. If you need 35,000 writes/sec, you still want roughly ten distinct high-cardinality prefixes so the service can spread them across ten partitions. Monotonic timestamp prefixes concentrate load on whatever partition is currently “newest” and make the auto-scaler chase you.
Consistency used to be the other famous caveat. For years S3 was eventually consistent for overwrites and deletes — a GET immediately after a PUT could return the old bytes or a 404 for a brand-new key — and teams built retry-and-verify loops to cope. As of December 2020, S3 provides strong read-after-write consistency for all operations on all objects. A GET after a PUT returns the latest data, period. If you are still carrying eventual-consistency workarounds, delete them; this is one of the rare cases where the old mental model is now actively wrong and the workaround is pure cost. (Other providers like GCS and Cloudflare R2 are also strongly consistent today.)
Presigned URLs — the load-bearing pattern
This is the most important pattern in object storage, the one the motivating failure was missing: the client talks to the object store directly, and your application only mints permission.
sequenceDiagram participant C as Client participant App as Your app participant S3 as Object store C->>App: want to upload profile.jpg App->>App: authz check\nbuild key App-->>C: presigned PUT URL\nexpires 300s C->>S3: PUT bytes directly S3-->>C: 200 OK C->>App: done, key recorded App->>App: write key to DB
A presigned URL is a normal object URL with a time-limited cryptographic signature appended, derived from your credentials, that authorizes exactly one operation — PUT this specific key, or GET that specific key — for a bounded window (say 300 seconds). The client uses it once and it expires. Your credentials never leave the server, and your application never touches a single byte of the file.
The shape that matters:
# Mint an upload URL — the app does authz, then steps out of the data path
url = s3.generate_presigned_url(
"put_object",
Params={"Bucket": "uploads", "Key": f"users/{user_id}/profile.jpg",
"ContentType": "image/jpeg"},
ExpiresIn=300,
)
# client then: PUT <url> with the raw bytes — no app hop
Uploads go client → S3. Downloads go S3 (or a CDN) → client. Your app handles only the metadata: who is allowed, what the key is, and recording the resulting key in a database. This is what keeps the application tier small, stateless, and cheap regardless of file size — a 5 GB upload costs your API the same near-zero resources as a 5 KB one, because it never sees the bytes.
Multipart uploads
A single PUT is fragile for large objects and S3 caps it at 5 GB anyway. Drop the connection at 4.9 GB and you start over from zero. Multipart upload solves this: split the object into parts (5 MB to 5 GB each, up to 10,000 parts), upload them in parallel, then send a CompleteMultipartUpload with the list of part ETags to stitch them server-side.
flowchart TD
S[Initiate\nmultipart] --> P1[Part 1\nupload]
S --> P2[Part 2\nupload]
S --> P3[Part 3\nupload]
P1 --> CM{Complete\nmanifest}
P2 --> CM
P3 --> CM
CM -->|all parts ok| OBJ[(Single object\nup to 5TB)]
P2 -. fails .-> RT[Retry\nonly part 2]
RT --> CM
Three wins: parallelism (saturate bandwidth with concurrent parts), resumability (a failed part retries alone, not the whole file), and size (objects up to 5 TB). It also composes with presigned URLs — you presign each part PUT so the client still uploads directly. The catch that bites later: incomplete multipart uploads keep their already-uploaded parts and you keep paying storage for them, invisible to a normal object listing, until you abort them. That is what the lifecycle rule below is for.
Lifecycle and storage tiering
Object stores offer storage classes that trade retrieval latency and cost. A lifecycle policy transitions objects between classes by age automatically: hot data in Standard, objects untouched for 30 days to Infrequent Access (cheaper storage, a per-GB retrieval fee), 90 days to a Glacier-class archive (retrieval in minutes to hours), and eventually expire/delete. Lifecycle rules also clean up old versions and — critically — abort incomplete multipart uploads. Tiering is usually the single biggest lever on a storage bill, because most blobs are written once, read heavily for a week, and then never touched again while you pay Standard rates for them forever.
The tradeoffs that bite
These look free at design time and send you an invoice later.
| Tradeoff | The free-looking choice | What it actually costs |
|---|---|---|
| Durability vs latency | Treating S3 as a fast disk | Tens-of-ms first byte; wrong for hot random reads |
| Cheap storage vs egress | Serving blobs straight from the bucket | Egress + per-request fees dominate the bill |
| Immutability vs edit cost | Frequent small edits to big objects | Every edit is a full rewrite of the whole object |
| Flat namespace vs listing | Using LIST as a directory read | Paginated scans, slow and costly at scale |
| Convenience vs security | Public-read bucket “for simplicity” | One misconfig = a world-readable data leak |
| In-app proxy vs direct | Streaming bytes through your service | File size becomes your scaling and bandwidth problem |
Two deserve emphasis. Cheap storage vs egress is the one that surprises finance: storing a terabyte is a rounding error, but moving a terabyte out to the internet, plus per-request charges on millions of small GETs, is where the real money goes. Serving a viral video directly from S3 to millions of viewers is an egress invoice with your name on it; a CDN in front turns most of those reads into cheap cache hits that never touch the bucket.
Immutability vs edit cost catches people who model a blob as a mutable file. If your access pattern is “append a line to this log object every second” or “patch 40 bytes in the middle of a large file,” every operation rewrites the entire object. That is not a tuning problem, it is the wrong storage model — you want a real database, a log system like Kafka, or a file/block volume instead.
Cost and performance: the part nobody models
Object storage performance is mostly about parallelism and what’s in front of it, not raw per-object speed. The levers, in rough order of impact:
- A CDN in front of reads. This is the biggest single lever for read-heavy assets. A cache hit at the edge (caching) is faster for the user and free of origin egress. A bucket fronted by a CDN with a 95% hit ratio pays origin egress on only 5% of reads.
- Parallelism over single-stream speed. One
GETstreams at maybe tens of MB/s; a hundred concurrentGETs across different objects saturate a 10 Gbps link. Object storage is a parallel beast — design clients to fan out, not to stream one giant file serially. - Multipart for large objects — parallel parts turn a 5 GB upload from a 10-minute fragile stream into a fast, resumable burst.
- Range
GETs — fetch byte ranges (Range: bytes=0-1048575) to read the head of a large object, stream video by chunk, or parallelize a download without pulling the whole thing. - Prefix spread — for extreme write rates, spread keys across high-cardinality prefixes so the service partitions throughput across them.
The numbers that actually shape decisions are economic, not latency. Roughly: storage is cents per GB-month, a GET costs a tiny fraction of a cent but you do billions of them, and internet egress is the dominant line item — often more than storage and requests combined for a media workload. Three rules of thumb that have saved real money:
- A CDN pays for itself the moment egress is non-trivial. Edge cache hits are cheaper than origin egress, and many CDN-to-internet rates beat raw cloud egress anyway.
- Millions of tiny objects cost more than their bytes because of per-request fees and per-object minimum billable sizes (e.g. IA-class minimum object size). Batch small files into archives where you can.
- Lifecycle tiering is free money on cold data — most of which you are storing at hot prices out of neglect.
Watch the right signals in observability: egress bytes, request counts split by verb, storage by class, CDN cache-hit ratio, and 4xx/5xx rates (a creeping 503 SlowDown means you are hitting prefix throughput limits). Per-object latency is the metric people instinctively watch and the one that matters least.
Failure modes
How object storage actually hurts you in production. Symptom → root cause → prevention.
Proxying large blobs through your app. Symptom: the application tier OOMs or times out under upload load, autoscaling makes it worse, bandwidth bill doubles. Root cause: every byte flows through your service, so file size and concurrency exhaust memory and double your bandwidth (in from client, out to S3) — the opening story. Prevention: presigned URLs for both upload and download; the app handles metadata only, never bytes.
Public bucket exposure. Symptom: a security researcher (or attacker) emails you about your customers’ private data being world-readable. Root cause: a permissive bucket policy or object ACL — often added “temporarily for testing.” This is one of the most common cloud data leaks in existence. Prevention: enable “block all public access” at the account and bucket level, serve reads only via presigned URLs or a CDN with origin access control, and audit policies continuously.
Egress and request-cost surprise. Symptom: a five-figure cloud bill after a traffic spike or a launch. Root cause: hot objects served directly from the bucket with no CDN, or millions of tiny objects racking up per-request charges. Prevention: CDN in front, monitor egress and request counts as first-class cost metrics, batch small objects.
Incomplete multipart uploads accumulating. Symptom: storage cost creeps up and doesn’t match the size of objects you can see in a listing. Root cause: failed or abandoned multipart uploads silently retain their parts and bill indefinitely, and they don’t appear in a normal ListObjects. Prevention: a lifecycle rule that aborts incomplete multipart uploads after N days (7 is typical).
Hot-prefix throttling. Symptom: 503 SlowDown under a heavy, bursty write pipeline. Root cause: a sudden burst all keyed under one monotonic prefix concentrates on a single partition faster than auto-scaling reacts. Prevention: high-cardinality prefixes and exponential-backoff retries on 503.
Treating “deleted” as gone with versioning on. Symptom: storage cost doesn’t drop after a big cleanup; “deleted” objects reappear. Root cause: versioning is enabled, so a DELETE just writes a delete marker and the old versions still cost money. Prevention: lifecycle rules to expire noncurrent versions, and know whether versioning is on before you reason about deletes.
Never let a large blob’s bytes flow through your application tier. The instant your service sits in the data path for uploads or downloads, file size and concurrency become your scaling problem instead of the object store’s — and the store was built for it while your stateless API tier was not. Mint a presigned URL, let the client talk to the bucket directly, and keep your app handling only metadata. This single rule prevents the majority of object-storage incidents I have ever been paged for.
Scaling it
The thing about object storage is that capacity never scales — it’s effectively infinite from day one. What you scale is the stuff around it: read distribution, write throughput, and the index.
- CDN in front, always, for read-heavy assets. This is step one, not an optimization for later. A CDN absorbs reads at the edge, cuts global latency, and collapses egress cost by serving cache hits instead of origin fetches. Lock the bucket private and use origin access control so only the CDN can read it — never a public bucket behind a CDN.
- Spread keys across prefixes for write-heavy throughput; avoid timestamp/sequential prefixes that concentrate load on one partition. This is the same load-spreading instinct behind consistent hashing and sharding, applied to a key namespace.
- Keep an index, don’t
LIST. Store object keys and metadata in a database (PostgreSQL or DynamoDB) so you query an index instead of scanning prefixes. The object store holds bytes; the database holds the map. A design that constantly lists large prefixes to “find files” gets slow and expensive — that’s a query, and queries belong in a database. - Lifecycle everything. Automate tiering by age and auto-abort incomplete multipart uploads. This is the main cost-control lever once you’re at scale.
- The wall you hit is cost and request rate, not capacity. The fixes are a CDN, prefix spread, batching small objects, and tiering cold data — never “a bigger bucket,” because the bucket was never the limit.
When to reach for it (and when not to)
Reach for object storage for user-generated content (images, video, documents), static website and SPA assets, backups and database snapshots, data-lake and analytics files (Parquet, logs), ML datasets and model artifacts, and any large write-once-read-many blob accessed over the network. It is the default home for anything big that’s served over HTTP, and pairing it with a CDN is the standard way to serve static media on the internet.
Don’t use it as a database. No transactions, no rich queries, no secondary indexes, no low-latency random access. Store the blob in object storage and the metadata plus the key pointer in a real database — that split is the canonical pattern and skipping it is how people end up doing LIST scans as a query engine.
Don’t use it for frequently-edited files or anything needing POSIX, append, or locking semantics — that’s file (EFS territory for shared filesystems) or block storage. Don’t use it for low-latency lookups on hot small values — that’s Redis or a database with a cache in front.
When to consider alternatives
- Fast disk for a database or OS → block storage (EBS and friends).
- Shared POSIX filesystem across many hosts → file storage (EFS/NFS), where append and locking matter.
- Structured, queryable, transactional data → PostgreSQL or DynamoDB; keep only the blob in object storage.
- Sub-millisecond hot small reads → Redis as a cache in front of the source of truth.
- An append-only event log or streaming backbone → Kafka or Message Queues, not a rewritten log object.
- Search and relevance over document contents → Elasticsearch as a secondary index.
The unifying rule: object storage wins specifically when the data is large, blob-shaped, immutable, and read over HTTP. The moment the requirement becomes “edit in place,” “query it,” “microsecond reads,” or “guaranteed delivery,” the right tool is something purpose-built, and object storage holds the bytes those tools point at.
Operational checklist
- Use presigned URLs for client uploads and downloads; never proxy blob bytes through the app tier.
- Default buckets to block all public access; serve reads via a CDN with origin access control or via presigned URLs, and audit policies on a schedule.
- Use multipart upload for objects over ~
100 MB; presign each part so the client still uploads directly. - Set a lifecycle rule to abort incomplete multipart uploads (e.g. after
7 days) — they bill silently and hide from listings. - Put a CDN in front of read-heavy assets and alarm on cache-hit ratio dropping; egress is the dominant cost line.
- Set lifecycle tiering by object age (Standard → IA → archive → expire) and expire noncurrent versions if versioning is on.
- Design keys with high-cardinality prefixes for high-throughput pipelines; avoid monotonic timestamp prefixes, and retry
503 SlowDownwith backoff. - Keep object keys plus metadata in a database index; never use
LISTas a query engine. - Track egress bytes, request counts by verb, storage by class, and
5xxrate as first-class cost and health metrics in observability. - Run post-upload work (virus scan, thumbnailing, transcoding) event-driven after the object lands, not inline in the upload path.
Summary
Object storage is the best home for large blobs on the planet, and almost all of its sharp edges come from one category error: treating it like a filesystem. It is a flat, immutable, HTTP-addressable key-value store with eleven nines of durability and tens-of-ms latency — built for parallel throughput and to never lose data, not for editing or fast random access. The one rule that prevents most incidents is to keep your application out of the data path: mint presigned URLs, let clients talk to the bucket directly, and handle only metadata in a database index. Put a CDN in front of reads because egress, not storage, is the bill. Use multipart for big objects and lifecycle rules to abort orphaned uploads and tier cold data. Lock buckets private by default. Do that and object storage becomes the most boring, reliable, and cheap dependency in your architecture — and store the blob here while the truth about it lives in a real database next door.
Appendix: durability vs availability (and the egress mental model)
Two fundamentals worth restating, because they get conflated.
- Durability is the probability your object still exists and is intact over time. S3’s eleven nines means that for ten million objects, you’d statistically expect to lose one every ten thousand years. It is achieved by replicating each object across multiple devices and (in standard classes) multiple facilities, with continuous integrity checks. Durability answers “will my data survive?”
- Availability is the probability you can reach the data right now. It is lower than durability (Standard targets around
99.99%) because a region or network blip can make data temporarily unreachable without it being lost. Availability answers “can I read it this second?” Don’t quote eleven nines when you mean uptime — they are different guarantees.
The egress mental model: think of object storage as a warehouse where storing pallets is nearly free, but a truck leaving the loading dock costs money every time, and the dock charges a small fee per truck regardless of how full it is. The way you cut the bill is to stop sending trucks out the same gate over and over — you put a local depot (the CDN) near your customers so most pickups never come back to the warehouse, and you stop sending out lots of near-empty trucks (batch small objects). Cheap storage, expensive movement: design for movement, not capacity.
Further reading
Incidents & deep-dives
Where this system breaks in production — and how it comes back.
No incident deep-dives yet. See the roadmap for what's coming.