Open on desktop

Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.

Best on desktop

Back to lesson

Object Storage System

intermediateStorageDistribution

Storage·50 min read

Object Storage System

1Understand the Problem & Establish Design Scope→2High-Level Design→3Deep Dive→4Wrap Up

StorageDistribution

§1Step 2 — High-Level Design

2High-Level Design

Design S3-compatible storage. Multipart uploads, erasure coding, and cross-region replication.

System architecture overview

Interactive diagram locked

Upgrade to Pro to build and run this system.

Stage 1 of 4Starting state — the problem to solve

Progressive build — add each component step by step

Interactive diagram locked

Upgrade to Pro to build and run this system.

Add Worker Services for Async Processing

Add workers that handle slow upload processing tasks: virus scanning, image resizing, metadata extraction.

What it does

Upload processing workers are server-side components that perform CPU-intensive operations on stored files asynchronously after the upload completes. These workers handle virus scanning to detect malware, format conversion to create multiple media versions, image and video transcoding to optimize for different devices and bandwidths, EXIF metadata extraction from images, and thumbnail generation for preview interfaces. Each worker is a stateless process that consumes upload completion events from a message queue, processes the file, and writes results back to the metadata database.

Why it matters

Processing uploads synchronously blocks the upload API thread, preventing it from handling new uploads. A virus scan might take 5-30 seconds. Video transcoding can take minutes. This serialization creates a bottleneck: clients wait for processing before getting a successful response, and the API can't accept new uploads. Workers decouple slow operations from the fast upload response path. The client gets an immediate 200 OK with an upload ID, then polls or receives webhooks for processing status.

Trade-off

Async processing means the uploaded file isn't immediately available in its final form. Users see an 'uploading' state (while parts are assembling), then a 'processing' state (while workers transcode/scan), then 'ready' when all async work completes. This introduces 5-60 second latency before a file is fully usable. For most applications this is acceptable — users expect processing time. However, for time-critical workflows (e.g., security incident response needing immediate threat analysis), synchronous processing may be required.

Real world

AWS S3 integrates with Lambda for post-upload processing — trigger functions run automatically on object creation. Dropbox uses a worker fleet for preview generation; all files get thumbnails within minutes of upload. Box uses workers for OCR and content extraction, enabling full-text search on document uploads. Google Drive uses workers to convert Office files to Google Sheets/Docs format.

Capacity math

If uploads average 5MB and workers process at 50MB/second, one worker processes 10 uploads/second. For 1K concurrent uploads/second peak, you need 100 worker instances. Workers should auto-scale based on queue depth; when queue hits 10K pending items, trigger 50 more workers to clear the backlog.

In the real world: AWS S3 integrates with Lambda for post-upload processing — trigger functions run automatically on object creation. Dropbox uses a worker fleet for preview generation; all files get thumbnails within minutes of upload. Box uses workers for OCR and content extraction, enabling full-text search on document uploads. Google Drive uses workers to convert Office files to Google Sheets/Docs format.

Add Postgres for File Metadata

Store file metadata (name, size, type, owner, permissions, versions) in Postgres for fast querying and access control.

What it does

Postgres stores the searchable file metadata catalog containing file names, sizes, content types (MIME types), owner IDs, access control lists (who can read/write/delete), version history with timestamps, and the S3 content key mapping to actual file bytes stored in erasure-coded chunks. The schema includes indexes on owner_id for efficient per-user queries, file names for search, and created_at for sorting. The s3_key field stores the SHA-256 hash or storage location pointer that directs the API to the correct erasure-coded chunks.

Why it matters

Object storage services like S3 have no queryable metadata beyond the object key itself. To answer user queries like 'show me all files owned by user X larger than 10MB created in the last week', you need a relational database. S3 stores only the raw bytes; Postgres stores the structured, searchable metadata. The API splits responsibility: S3 is the content repository (durability, large-scale storage), Postgres is the metadata index (queryability, access control, versioning).

Trade-off

Metadata and content can drift out of sync if writes aren't atomic. If you write metadata to Postgres first, then attempt to upload to S3 and fail, you have orphaned metadata pointing to non-existent content. Conversely, if you upload to S3 first and Postgres crashes before persisting metadata, the content is unretrievable. Use the saga pattern or transactional outbox: write metadata to Postgres first, then upload to S3. On S3 failure, rollback the Postgres record. This ensures the metadata-content mapping is always consistent, at the cost of additional logic.

Real world

Dropbox uses MySQL (and later migrated to custom stores) for file metadata with millions of files per user. Box uses MySQL for the same metadata catalog. GitHub uses Postgres for repository metadata, pull request history, and issue tracking — Git stores content blobs, Postgres stores the structured data. All major cloud storage providers separate metadata from content storage.

Capacity math

At 1B files × 500 bytes per metadata record = 500GB Postgres storage. For fast access, partition by owner_id so each user's files are queried from a single partition. S3 key lookups are O(1) hash table operations; Postgres just stores the mapping. At 10K uploads/second, Postgres can handle the metadata write throughput with proper indexing and connection pooling on a 16-core instance with 64GB RAM.

In the real world: Dropbox uses MySQL (and later migrated to custom stores) for file metadata with millions of files per user. Box uses MySQL for the same metadata catalog. GitHub uses Postgres for repository metadata, pull request history, and issue tracking — Git stores content blobs, Postgres stores the structured data. All major cloud storage providers separate metadata from content storage.

Add CDN for File Delivery

Add a CDN in front of S3 to serve files with low latency globally and reduce S3 egress costs.

What it does

A content delivery network (CDN) caches file content at geographically distributed edge locations (points of presence) around the world. Instead of all downloads originating from S3 in a single region (e.g., us-east-1), the CDN intercepts requests and serves cached copies from the nearest PoP. If the file is already cached at a Tokyo PoP, a Japanese user downloads from Tokyo (20ms latency) instead of fetching from us-east-1 (150ms latency). The CDN automatically fetches the file from origin once on first request, then serves all subsequent requests from cache until the TTL expires.

Why it matters

S3 is deployed in a single geographic region for administrative simplicity. A user in Tokyo downloading a 100MB file from S3 in us-east-1 has 150ms network latency, plus 1.5+ seconds of transfer time (100MB ÷ 800Mbps ≈ 1.25s), totaling 2+ seconds. With CDN edge caching, the same download takes < 1.5 seconds (20ms + 1.25s). More importantly, S3 egress (outbound data transfer) costs $0.09/GB globally, while CDN egress costs $0.01-0.03/GB depending on the provider. For a service delivering 100GB/day globally, CDN saves 7× on bandwidth costs.

Trade-off

CDN caching works seamlessly for public files or files accessed with presigned URLs with long TTLs (where the URL itself encodes the access token). Private files requiring authentication on every request cannot be efficiently cached — the CDN would need to authenticate each request, defeating the purpose. For private files, use presigned URLs with 1-hour TTLs; clients must refresh the URL when it expires. Alternatively, use signed cookies set server-side.

Real world

Dropbox uses Cloudflare CDN for publicly shared files and presigned URLs for private downloads. Box uses Akamai for file delivery. GitHub uses Fastly for repository file downloads and release artifacts. AWS S3 + CloudFront is the canonical pattern: S3 stores content, CloudFront CDN caches and distributes globally, Route53 directs users to the nearest PoP.

Capacity math

With a cache hit rate > 90% for popular files, 90% of bandwidth comes from edge servers at $0.01/GB vs S3 at $0.09/GB, saving 8× on egress costs. For a service delivering 1PB/month globally, this saves ~$720K/month. CDN PoPs typically cache 100-1000GB each depending on the provider; popular files stay cached indefinitely, less popular files evict based on LRU.

In the real world: Dropbox uses Cloudflare CDN for publicly shared files and presigned URLs for private downloads. Box uses Akamai for file delivery. GitHub uses Fastly for repository file downloads and release artifacts. AWS S3 + CloudFront is the canonical pattern: S3 stores content, CloudFront CDN caches and distributes globally, Route53 directs users to the nearest PoP.

Storage Node Failure: One storage node holding 1/12 erasure-coded chunks fails. Remaining 11 nodes can still reconstruct the object. System continues serving with no data loss.

§2Step 3 — Deep Dive

3Deep Dive

Technique	Storage overhead	Durability	Read cost	Best for	Cost	Ops burden
3× replication	200%	11 nines	Read any copy	Simple, low-latency reads	High	Low
Erasure coding (4+2)	50%	11+ nines	Reconstruct from 4/6	Large objects, cost-sensitive ✓	Medium	Low
Erasure coding (6+3)	50%	12+ nines	Reconstruct from 6/9	Maximum durability (AWS S3)	Medium	Low
Reed-Solomon (8+4)	50%	Very high	Reconstruct from 8/12	Archival, cold storage	Low	Low
LRC (12+2+2)	33%	Very high	Local repair cheap	Facebook f4, warm storage	Low	Medium

Object storage durability techniques — erasure coding beats replication for cost.

pythonContent-addressable storage — SHA-256 chunk deduplication

import hashlib, os
from pathlib import Path

CHUNK_SIZE   = 4 * 1024 * 1024  # 4MB
STORAGE_ROOT = Path("/data/chunks")

def sha256(data: bytes) -> str:
    return hashlib.sha256(data).hexdigest()

def upload_object(file_path: str) -> dict:
    chunk_ids = []
    with open(file_path, 'rb') as f:
        while True:
            chunk = f.read(CHUNK_SIZE)
            if not chunk:
                break
            chunk_hash = sha256(chunk)
            chunk_ids.append(chunk_hash)
            dest = STORAGE_ROOT / chunk_hash[:2] / chunk_hash
            if not dest.exists():
                dest.parent.mkdir(parents=True, exist_ok=True)
                dest.write_bytes(chunk)  # same content = stored once

    object_hash = sha256(''.join(chunk_ids).encode())
    return {'object_id': object_hash, 'chunks': chunk_ids,
            'size': os.path.getsize(file_path)}

def download_object(manifest: dict, dest_path: str):
    with open(dest_path, 'wb') as out:
        for chunk_hash in manifest['chunks']:
            out.write((STORAGE_ROOT / chunk_hash[:2] / chunk_hash).read_bytes())

Component	Why Add It	Tradeoff
Worker Services for Async Processing	Processing uploads synchronously blocks the upload API thread, preventing it from handling new uploads.	Async processing means the uploaded file isn't immediately available in its final form.
Postgres for File Metadata	Object storage services like S3 have no queryable metadata beyond the object key itself.	Metadata and content can drift out of sync if writes aren't atomic.
CDN for File Delivery	S3 is deployed in a single geographic region for administrative simplicity.	CDN caching works seamlessly for public files or files accessed with presigned URLs with long TTLs (where the URL itself encodes the access token).

Design decision tradeoffs

Storage Node Failure

One storage node holding 1/12 erasure-coded chunks fails. Remaining 11 nodes can still reconstruct the object. System continues serving with no data loss.

CDN Cache Stampede

CDN PoP in a region goes offline after a popular file expires from cache. All concurrent requests hammer the origin S3 for the same file. Origin gets 100x request spike.

Multipart Upload Surge

Video upload event (e.g., live stream archive) causes 10K concurrent multipart uploads. API scales workers but temporary spike in pending parts floods Postgres.

Erasure coding (9+3 Reed-Solomon): split an object into 9 data chunks + 3 parity chunks. Distribute across 12 storage nodes. Any 9 of the 12 chunks can reconstruct the full object. Storage overhead: 33% (vs 200% for 3× replication).

Content-addressed storage: object key = SHA-256(content). Same content stored once regardless of how many times it's uploaded. Deduplication is automatic — check if SHA-256 hash already exists before storing.

Multipart upload: split large files into 100MB parts. Upload parts in parallel. Server stores parts separately, assembles on completion. If upload interrupted, resume by re-uploading only failed parts. Essential for files >100MB over unreliable connections.

§3Step 4 — Wrap Up

4Wrap Up

Decision	Choice	Why
Worker Services for Async Processing	Upload processing workers are server-side components that perform CPU-intensive operations on stored files asynchronously after the upload completes.	Processing uploads synchronously blocks the upload API thread, preventing it from handling new uploads.
Postgres for File Metadata	Postgres stores the searchable file metadata catalog containing file names, sizes, content types (MIME types), owner IDs, access control lists (who can read/write/delete), version history with timestamps, and the S3 content key mapping to actual file bytes stored in erasure-coded chunks.	Object storage services like S3 have no queryable metadata beyond the object key itself.
CDN for File Delivery	A content delivery network (CDN) caches file content at geographically distributed edge locations (points of presence) around the world.	S3 is deployed in a single geographic region for administrative simplicity.

Key design decisions

If the interviewer asks to scale 10×: Scale data without losing your mind. Horizontal sharding: partition by a high-cardinality key (user_id, hash) to spread writes across nodes.

10× Target100K RPSwhere your architecture must hold

What's next