Open on desktop
Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.
Object Storage System
§1Step 2 — High-Level Design
Design S3-compatible storage. Multipart uploads, erasure coding, and cross-region replication.
Interactive diagram locked
Upgrade to Pro to build and run this system.
Interactive diagram locked
Upgrade to Pro to build and run this system.
Add workers that handle slow upload processing tasks: virus scanning, image resizing, metadata extraction.
Upload processing workers are server-side components that perform CPU-intensive operations on stored files asynchronously after the upload completes. These workers handle virus scanning to detect malware, format conversion to create multiple media versions, image and video transcoding to optimize for different devices and bandwidths, EXIF metadata extraction from images, and thumbnail generation for preview interfaces. Each worker is a stateless process that consumes upload completion events from a message queue, processes the file, and writes results back to the metadata database.
Processing uploads synchronously blocks the upload API thread, preventing it from handling new uploads. A virus scan might take 5-30 seconds. Video transcoding can take minutes. This serialization creates a bottleneck: clients wait for processing before getting a successful response, and the API can't accept new uploads. Workers decouple slow operations from the fast upload response path. The client gets an immediate 200 OK with an upload ID, then polls or receives webhooks for processing status.
Async processing means the uploaded file isn't immediately available in its final form. Users see an 'uploading' state (while parts are assembling), then a 'processing' state (while workers transcode/scan), then 'ready' when all async work completes. This introduces 5-60 second latency before a file is fully usable. For most applications this is acceptable — users expect processing time. However, for time-critical workflows (e.g., security incident response needing immediate threat analysis), synchronous processing may be required.
AWS S3 integrates with Lambda for post-upload processing — trigger functions run automatically on object creation. Dropbox uses a worker fleet for preview generation; all files get thumbnails within minutes of upload. Box uses workers for OCR and content extraction, enabling full-text search on document uploads. Google Drive uses workers to convert Office files to Google Sheets/Docs format.
If uploads average 5MB and workers process at 50MB/second, one worker processes 10 uploads/second. For 1K concurrent uploads/second peak, you need 100 worker instances. Workers should auto-scale based on queue depth; when queue hits 10K pending items, trigger 50 more workers to clear the backlog.
Store file metadata (name, size, type, owner, permissions, versions) in Postgres for fast querying and access control.
Postgres stores the searchable file metadata catalog containing file names, sizes, content types (MIME types), owner IDs, access control lists (who can read/write/delete), version history with timestamps, and the S3 content key mapping to actual file bytes stored in erasure-coded chunks. The schema includes indexes on owner_id for efficient per-user queries, file names for search, and created_at for sorting. The s3_key field stores the SHA-256 hash or storage location pointer that directs the API to the correct erasure-coded chunks.
Object storage services like S3 have no queryable metadata beyond the object key itself. To answer user queries like 'show me all files owned by user X larger than 10MB created in the last week', you need a relational database. S3 stores only the raw bytes; Postgres stores the structured, searchable metadata. The API splits responsibility: S3 is the content repository (durability, large-scale storage), Postgres is the metadata index (queryability, access control, versioning).
Metadata and content can drift out of sync if writes aren't atomic. If you write metadata to Postgres first, then attempt to upload to S3 and fail, you have orphaned metadata pointing to non-existent content. Conversely, if you upload to S3 first and Postgres crashes before persisting metadata, the content is unretrievable. Use the saga pattern or transactional outbox: write metadata to Postgres first, then upload to S3. On S3 failure, rollback the Postgres record. This ensures the metadata-content mapping is always consistent, at the cost of additional logic.
Dropbox uses MySQL (and later migrated to custom stores) for file metadata with millions of files per user. Box uses MySQL for the same metadata catalog. GitHub uses Postgres for repository metadata, pull request history, and issue tracking — Git stores content blobs, Postgres stores the structured data. All major cloud storage providers separate metadata from content storage.
At 1B files × 500 bytes per metadata record = 500GB Postgres storage. For fast access, partition by owner_id so each user's files are queried from a single partition. S3 key lookups are O(1) hash table operations; Postgres just stores the mapping. At 10K uploads/second, Postgres can handle the metadata write throughput with proper indexing and connection pooling on a 16-core instance with 64GB RAM.
Add a CDN in front of S3 to serve files with low latency globally and reduce S3 egress costs.
A content delivery network (CDN) caches file content at geographically distributed edge locations (points of presence) around the world. Instead of all downloads originating from S3 in a single region (e.g., us-east-1), the CDN intercepts requests and serves cached copies from the nearest PoP. If the file is already cached at a Tokyo PoP, a Japanese user downloads from Tokyo (20ms latency) instead of fetching from us-east-1 (150ms latency). The CDN automatically fetches the file from origin once on first request, then serves all subsequent requests from cache until the TTL expires.
S3 is deployed in a single geographic region for administrative simplicity. A user in Tokyo downloading a 100MB file from S3 in us-east-1 has 150ms network latency, plus 1.5+ seconds of transfer time (100MB ÷ 800Mbps ≈ 1.25s), totaling 2+ seconds. With CDN edge caching, the same download takes < 1.5 seconds (20ms + 1.25s). More importantly, S3 egress (outbound data transfer) costs $0.09/GB globally, while CDN egress costs $0.01-0.03/GB depending on the provider. For a service delivering 100GB/day globally, CDN saves 7× on bandwidth costs.
CDN caching works seamlessly for public files or files accessed with presigned URLs with long TTLs (where the URL itself encodes the access token). Private files requiring authentication on every request cannot be efficiently cached — the CDN would need to authenticate each request, defeating the purpose. For private files, use presigned URLs with 1-hour TTLs; clients must refresh the URL when it expires. Alternatively, use signed cookies set server-side.
Dropbox uses Cloudflare CDN for publicly shared files and presigned URLs for private downloads. Box uses Akamai for file delivery. GitHub uses Fastly for repository file downloads and release artifacts. AWS S3 + CloudFront is the canonical pattern: S3 stores content, CloudFront CDN caches and distributes globally, Route53 directs users to the nearest PoP.
With a cache hit rate > 90% for popular files, 90% of bandwidth comes from edge servers at $0.01/GB vs S3 at $0.09/GB, saving 8× on egress costs. For a service delivering 1PB/month globally, this saves ~$720K/month. CDN PoPs typically cache 100-1000GB each depending on the provider; popular files stay cached indefinitely, less popular files evict based on LRU.
§2Step 3 — Deep Dive
Upload processing workers are server-side components that perform CPU-intensive operations on stored files asynchronously after the upload completes. These workers handle virus scanning to detect malware, format conversion to create multiple media versions, image and video transcoding to optimize for different devices and bandwidths, EXIF metadata extraction from images, and thumbnail generation for preview interfaces. Each worker is a stateless process that consumes upload completion events from a message queue, processes the file, and writes results back to the metadata database.
| Technique | Storage overhead | Durability | Read cost | Best for | Cost | Ops burden |
|---|---|---|---|---|---|---|
| 3× replication | 200% | 11 nines | Read any copy | Simple, low-latency reads | High | Low |
| Erasure coding (4+2) | 50% | 11+ nines | Reconstruct from 4/6 | Large objects, cost-sensitive ✓ | Medium | Low |
| Erasure coding (6+3) | 50% | 12+ nines | Reconstruct from 6/9 | Maximum durability (AWS S3) | Medium | Low |
| Reed-Solomon (8+4) | 50% | Very high | Reconstruct from 8/12 | Archival, cold storage | Low | Low |
| LRC (12+2+2) | 33% | Very high | Local repair cheap | Facebook f4, warm storage | Low | Medium |
Object storage durability techniques — erasure coding beats replication for cost.
import hashlib, os
from pathlib import Path
CHUNK_SIZE = 4 * 1024 * 1024 # 4MB
STORAGE_ROOT = Path("/data/chunks")
def sha256(data: bytes) -> str:
return hashlib.sha256(data).hexdigest()
def upload_object(file_path: str) -> dict:
chunk_ids = []
with open(file_path, 'rb') as f:
while True:
chunk = f.read(CHUNK_SIZE)
if not chunk:
break
chunk_hash = sha256(chunk)
chunk_ids.append(chunk_hash)
dest = STORAGE_ROOT / chunk_hash[:2] / chunk_hash
if not dest.exists():
dest.parent.mkdir(parents=True, exist_ok=True)
dest.write_bytes(chunk) # same content = stored once
object_hash = sha256(''.join(chunk_ids).encode())
return {'object_id': object_hash, 'chunks': chunk_ids,
'size': os.path.getsize(file_path)}
def download_object(manifest: dict, dest_path: str):
with open(dest_path, 'wb') as out:
for chunk_hash in manifest['chunks']:
out.write((STORAGE_ROOT / chunk_hash[:2] / chunk_hash).read_bytes())| Component | Why Add It | Tradeoff |
|---|---|---|
| Worker Services for Async Processing | Processing uploads synchronously blocks the upload API thread, preventing it from handling new uploads. | Async processing means the uploaded file isn't immediately available in its final form. |
| Postgres for File Metadata | Object storage services like S3 have no queryable metadata beyond the object key itself. | Metadata and content can drift out of sync if writes aren't atomic. |
| CDN for File Delivery | S3 is deployed in a single geographic region for administrative simplicity. | CDN caching works seamlessly for public files or files accessed with presigned URLs with long TTLs (where the URL itself encodes the access token). |
Design decision tradeoffs
One storage node holding 1/12 erasure-coded chunks fails. Remaining 11 nodes can still reconstruct the object. System continues serving with no data loss.
CDN PoP in a region goes offline after a popular file expires from cache. All concurrent requests hammer the origin S3 for the same file. Origin gets 100x request spike.
Video upload event (e.g., live stream archive) causes 10K concurrent multipart uploads. API scales workers but temporary spike in pending parts floods Postgres.
§3Step 4 — Wrap Up
| Decision | Choice | Why |
|---|---|---|
| Worker Services for Async Processing | Upload processing workers are server-side components that perform CPU-intensive operations on stored files asynchronously after the upload completes. | Processing uploads synchronously blocks the upload API thread, preventing it from handling new uploads. |
| Postgres for File Metadata | Postgres stores the searchable file metadata catalog containing file names, sizes, content types (MIME types), owner IDs, access control lists (who can read/write/delete), version history with timestamps, and the S3 content key mapping to actual file bytes stored in erasure-coded chunks. | Object storage services like S3 have no queryable metadata beyond the object key itself. |
| CDN for File Delivery | A content delivery network (CDN) caches file content at geographically distributed edge locations (points of presence) around the world. | S3 is deployed in a single geographic region for administrative simplicity. |
Key design decisions