Open on desktop
Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.
Web Crawler
§1Step 2 — High-Level Design
Build a distributed crawler that indexes 1B pages. Handle politeness, deduplication, and URL frontier.
Add crawler workers that fetch HTML from URLs in the queue and extract new links for further crawling.
Crawler nodes are worker processes that fetch web pages via HTTP, parse HTML to extract links, and discover new URLs to crawl.
Web crawling is embarrassingly parallel — fetching URL A doesn't block fetching URL B. Multiple crawlers increase throughput proportionally. Each crawler handles 10-100 pages/second depending on network and target server speed.
Crawlers must respect robots.txt and rate limits to avoid being blocked. Too aggressive = IP bans. Too slow = stale index. Most search engines use politeness delays of 1-5 seconds per domain.
Googlebot runs millions of crawlers globally. CommonCrawl (the open web archive) uses distributed Nutch crawlers. Ahrefs maintains a 6B-page crawl index.
One crawler at 10 pages/second fetches 864K pages/day. For a 100B page index, you need 10K crawlers running continuously.
Add a message queue to buffer discovered URLs, forming the crawl frontier that distributes work to crawler nodes.
The URL frontier is a message queue (Kafka) that stores discovered but not-yet-crawled URLs. Crawlers consume URLs from it and publish newly discovered URLs back to it.
The frontier decouples URL discovery from URL fetching. It enables priority queuing (crawl news sites more often than static pages), deduplication, and politeness scheduling without blocking crawlers.
The frontier can grow to trillions of URLs (the web has ~50B pages). Kafka's log compaction and partitioning by domain host enables efficient priority queuing at this scale.
Google's crawl frontier is distributed across data centers. Apache Nutch uses Kafka as its frontier. Scrapy-Redis uses Redis lists as a distributed crawl queue.
Kafka handles 1M URLs/second throughput. At 100 bytes per URL, 1B queued URLs = 100GB — manageable with Kafka's disk-based storage.
Add a Redis Bloom Filter or set to check whether a URL has already been crawled before enqueuing it.
Redis stores the set of visited URLs (using a Bloom Filter for memory efficiency) so crawlers can quickly check if a URL has already been fetched.
Without deduplication, the same URL gets crawled thousands of times (many sites link to each other). Redis deduplication ensures each URL is crawled once, maximizing the fraction of unique pages discovered per unit of work.
Bloom Filters have a small false positive rate — some new URLs are incorrectly flagged as seen and skipped. This is acceptable: a few missed pages is better than re-crawling 90% of the web.
Google uses multiple deduplication layers. CommonCrawl uses URL normalization + hash-based deduplication. Screaming Frog (SEO crawler) uses in-memory hash sets for visited URLs.
100B URL fingerprints (SHA-256 truncated to 64 bits) in a Redis Bloom Filter requires ~150GB RAM with 1% false positive rate, or 1.5TB for 0.01%.
Store the raw HTML and extracted content of crawled pages in object storage (S3) for indexing and future reprocessing.
Object storage (S3/GCS) stores the raw crawled content (HTML, metadata) durably and cheaply, making it available for the downstream indexing pipeline.
Crawling is expensive — you don't want to re-crawl just to reprocess with an improved indexing algorithm. Storing raw content lets you replay and reindex from the same crawl data indefinitely.
Object storage is eventually consistent on some providers. S3 charges per request ($0.004/10K). At 100B pages × 50KB/page, storage cost is ~$2.3M/month for raw HTML.
CommonCrawl stores 2.7 billion web pages of raw HTML in S3 (publicly available, ~300TB per crawl). Google Wayback Machine uses similar cold storage. Ahrefs stores crawl data in distributed object stores.
S3 has essentially unlimited capacity. At $0.023/GB/month, storing 1PB of crawl data costs $23K/month. Lifecycle policies compress and tier old data to Glacier ($0.004/GB/month).
At high crawl volume, use a message queue to distribute URLs to worker pools instead of direct assignment.
A message queue (Kafka or SQS) acts as the URL frontier, buffering discovered URLs and distributing them to crawler workers.
At high volume, URL discovery is faster than crawling. A queue decouples discovery from fetching — the queue absorbs the burst.
Duplicate URLs will appear in the queue. Use a distributed bloom filter to deduplicate before enqueuing.
Googlebot uses a massive distributed URL frontier. Scrapy with Redis (scrapy-redis) implements this for Python crawlers.
Kafka handles 1M+ events/second. A 3-broker cluster can buffer billions of pending URLs.
At peak crawl demand, scale worker nodes horizontally behind a load balancer to maximize concurrent fetch throughput.
A load balancer (or Kafka consumer group) distributes URL processing across a horizontally scaled worker pool.
At peak, crawl throughput is limited by worker count. Adding nodes scales fetch throughput linearly up to the queue partition count.
Politeness constraints (don't crawl one domain too fast) require per-domain rate limiting across all workers, not just per-worker.
Common Crawl runs thousands of crawler nodes on AWS Spot Instances. Each node is stateless and pulls from a shared queue.
Each worker fetches 50-200 pages/second. 100 workers = 5K-20K pages/second, sufficient for broad crawls.
§2Step 3 — Deep Dive
Crawler nodes are worker processes that fetch web pages via HTTP, parse HTML to extract links, and discover new URLs to crawl.
| Storage | Throughput | Persistence | Priority support | Best for | Cost | Ops burden |
|---|---|---|---|---|---|---|
| Redis sorted set | 100K ops/sec | Optional | Yes (score = priority) | Hot frontier, recent URLs ✓ | Medium | Medium |
| Kafka topic | 1M+ msg/sec | Yes (log) | Partial (priority topics) | Large-scale, distributed crawl | Medium | Medium |
| Postgres queue | 10K/sec | Yes | Yes (ORDER BY priority) | Small crawl, exact dedup needed | Medium | Low |
| RabbitMQ | 50K msg/sec | Yes | Yes (priority queues) | Complex routing, multiple crawlers | Medium | Medium |
| In-memory (single node) | >1M ops/sec | No | Yes | Prototype, single-crawler | Low | Low |
URL frontier storage — Redis for hot queue, DB for durable frontier.
import time
import redis
import requests
from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse
r = redis.Redis(decode_responses=True)
CRAWL_DELAY = 1.0 # seconds between requests to same domain
def can_crawl(url: str) -> bool:
"""Check robots.txt compliance."""
domain = urlparse(url).netloc
rp_key = f"robots:{domain}"
robots_txt = r.get(rp_key)
if not robots_txt:
try:
resp = requests.get(f"https://{domain}/robots.txt", timeout=5)
robots_txt = resp.text
r.setex(rp_key, 86400, robots_txt) # cache 24h
except Exception:
return True # assume crawlable if robots.txt unreachable
rp = RobotFileParser()
rp.parse(robots_txt.splitlines())
return rp.can_fetch("*", url)
def crawl(url: str) -> str | None:
domain = urlparse(url).netloc
# Politeness: one request per second per domain (distributed across fleet)
last_key = f"crawl:last:{domain}"
last_crawl = float(r.get(last_key) or 0)
wait = CRAWL_DELAY - (time.time() - last_crawl)
if wait > 0:
time.sleep(wait)
if not can_crawl(url):
return None
resp = requests.get(url, timeout=10, headers={'User-Agent': 'MyCrawler/1.0'})
r.set(last_key, time.time())
return resp.text| Component | Why Add It | Tradeoff |
|---|---|---|
| Crawler Nodes | Web crawling is embarrassingly parallel — fetching URL A doesn't block fetching URL B. | Crawlers must respect robots. |
| Message Queue for URL Frontier | The frontier decouples URL discovery from URL fetching. | The frontier can grow to trillions of URLs (the web has ~50B pages). |
| Redis for URL Deduplication | Without deduplication, the same URL gets crawled thousands of times (many sites link to each other). | Bloom Filters have a small false positive rate — some new URLs are incorrectly flagged as seen and skipped. |
| Object Storage for Crawled Content | Crawling is expensive — you don't want to re-crawl just to reprocess with an improved indexing algorithm. | Object storage is eventually consistent on some providers. |
| Message Queue for URL Distribution | At high volume, URL discovery is faster than crawling. | Duplicate URLs will appear in the queue. |
| Load Balancer for Worker Pool | At peak, crawl throughput is limited by worker count. | Politeness constraints (don't crawl one domain too fast) require per-domain rate limiting across all workers, not just per-worker. |
Design decision tradeoffs
crawler-1 crashes with 10K URLs in-flight. Without checkpointing, those URLs are lost from the frontier. How do you implement exactly-once URL processing using the message queue and idempotency keys?
Crawling a highly interconnected site causes exponential URL discovery — each page links to 100 new pages. The URL frontier grows to 1B entries, exhausting memory. How do you implement URL deduplication with Bloom filters and prioritization to bound frontier size?
All 10 crawlers hammer the same domain simultaneously, violating robots.txt and getting IP-banned. How do you implement per-domain rate limiting, crawl delay queues, and distributed polite crawling?
§3Step 4 — Wrap Up
| Decision | Choice | Why |
|---|---|---|
| Crawler Nodes | Crawler nodes are worker processes that fetch web pages via HTTP, parse HTML to extract links, and discover new URLs to crawl. | Web crawling is embarrassingly parallel — fetching URL A doesn't block fetching URL B. |
| Message Queue for URL Frontier | The URL frontier is a message queue (Kafka) that stores discovered but not-yet-crawled URLs. | The frontier decouples URL discovery from URL fetching. |
| Redis for URL Deduplication | Redis stores the set of visited URLs (using a Bloom Filter for memory efficiency) so crawlers can quickly check if a URL has already been fetched. | Without deduplication, the same URL gets crawled thousands of times (many sites link to each other). |
| Object Storage for Crawled Content | Object storage (S3/GCS) stores the raw crawled content (HTML, metadata) durably and cheaply, making it available for the downstream indexing pipeline. | Crawling is expensive — you don't want to re-crawl just to reprocess with an improved indexing algorithm. |
| Message Queue for URL Distribution | A message queue (Kafka or SQS) acts as the URL frontier, buffering discovered URLs and distributing them to crawler workers. | At high volume, URL discovery is faster than crawling. |
| Load Balancer for Worker Pool | A load balancer (or Kafka consumer group) distributes URL processing across a horizontally scaled worker pool. | At peak, crawl throughput is limited by worker count. |
Key design decisions