Open on desktop

Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.

Best on desktop

Back to lesson

Web Crawler

beginnerCrawlingDistribution

Data·40 min read

Web Crawler

1Understand the Problem & Establish Design Scope→2High-Level Design→3Deep Dive→4Wrap Up

CrawlingDistribution

§1Step 2 — High-Level Design

2High-Level Design

Build a distributed crawler that indexes 1B pages. Handle politeness, deduplication, and URL frontier.

System architecture overview

Stage 1 of 7Starting state — the problem to solve

Progressive build — add each component step by step

Add Crawler Nodes

Add crawler workers that fetch HTML from URLs in the queue and extract new links for further crawling.

What it does

Crawler nodes are worker processes that fetch web pages via HTTP, parse HTML to extract links, and discover new URLs to crawl.

Why it matters

Web crawling is embarrassingly parallel — fetching URL A doesn't block fetching URL B. Multiple crawlers increase throughput proportionally. Each crawler handles 10-100 pages/second depending on network and target server speed.

Trade-off

Crawlers must respect robots.txt and rate limits to avoid being blocked. Too aggressive = IP bans. Too slow = stale index. Most search engines use politeness delays of 1-5 seconds per domain.

Real world

Googlebot runs millions of crawlers globally. CommonCrawl (the open web archive) uses distributed Nutch crawlers. Ahrefs maintains a 6B-page crawl index.

Capacity math

One crawler at 10 pages/second fetches 864K pages/day. For a 100B page index, you need 10K crawlers running continuously.

In the real world: Googlebot runs millions of crawlers globally. CommonCrawl (the open web archive) uses distributed Nutch crawlers. Ahrefs maintains a 6B-page crawl index.

Add a Message Queue for URL Frontier

Add a message queue to buffer discovered URLs, forming the crawl frontier that distributes work to crawler nodes.

What it does

The URL frontier is a message queue (Kafka) that stores discovered but not-yet-crawled URLs. Crawlers consume URLs from it and publish newly discovered URLs back to it.

Why it matters

The frontier decouples URL discovery from URL fetching. It enables priority queuing (crawl news sites more often than static pages), deduplication, and politeness scheduling without blocking crawlers.

Trade-off

The frontier can grow to trillions of URLs (the web has ~50B pages). Kafka's log compaction and partitioning by domain host enables efficient priority queuing at this scale.

Real world

Google's crawl frontier is distributed across data centers. Apache Nutch uses Kafka as its frontier. Scrapy-Redis uses Redis lists as a distributed crawl queue.

Capacity math

Kafka handles 1M URLs/second throughput. At 100 bytes per URL, 1B queued URLs = 100GB — manageable with Kafka's disk-based storage.

In the real world: Google's crawl frontier is distributed across data centers. Apache Nutch uses Kafka as its frontier. Scrapy-Redis uses Redis lists as a distributed crawl queue.

Add Redis for URL Deduplication

Add a Redis Bloom Filter or set to check whether a URL has already been crawled before enqueuing it.

What it does

Redis stores the set of visited URLs (using a Bloom Filter for memory efficiency) so crawlers can quickly check if a URL has already been fetched.

Why it matters

Without deduplication, the same URL gets crawled thousands of times (many sites link to each other). Redis deduplication ensures each URL is crawled once, maximizing the fraction of unique pages discovered per unit of work.

Trade-off

Bloom Filters have a small false positive rate — some new URLs are incorrectly flagged as seen and skipped. This is acceptable: a few missed pages is better than re-crawling 90% of the web.

Real world

Google uses multiple deduplication layers. CommonCrawl uses URL normalization + hash-based deduplication. Screaming Frog (SEO crawler) uses in-memory hash sets for visited URLs.

Capacity math

100B URL fingerprints (SHA-256 truncated to 64 bits) in a Redis Bloom Filter requires ~150GB RAM with 1% false positive rate, or 1.5TB for 0.01%.

In the real world: Google uses multiple deduplication layers. CommonCrawl uses URL normalization + hash-based deduplication. Screaming Frog (SEO crawler) uses in-memory hash sets for visited URLs.

Add Object Storage for Crawled Content

Store the raw HTML and extracted content of crawled pages in object storage (S3) for indexing and future reprocessing.

What it does

Object storage (S3/GCS) stores the raw crawled content (HTML, metadata) durably and cheaply, making it available for the downstream indexing pipeline.

Why it matters

Crawling is expensive — you don't want to re-crawl just to reprocess with an improved indexing algorithm. Storing raw content lets you replay and reindex from the same crawl data indefinitely.

Trade-off

Object storage is eventually consistent on some providers. S3 charges per request ($0.004/10K). At 100B pages × 50KB/page, storage cost is ~$2.3M/month for raw HTML.

Real world

CommonCrawl stores 2.7 billion web pages of raw HTML in S3 (publicly available, ~300TB per crawl). Google Wayback Machine uses similar cold storage. Ahrefs stores crawl data in distributed object stores.

Capacity math

S3 has essentially unlimited capacity. At $0.023/GB/month, storing 1PB of crawl data costs $23K/month. Lifecycle policies compress and tier old data to Glacier ($0.004/GB/month).

In the real world: CommonCrawl stores 2.7 billion web pages of raw HTML in S3 (publicly available, ~300TB per crawl). Google Wayback Machine uses similar cold storage. Ahrefs stores crawl data in distributed object stores.

Add a Message Queue for URL Distribution

At high crawl volume, use a message queue to distribute URLs to worker pools instead of direct assignment.

What it does

A message queue (Kafka or SQS) acts as the URL frontier, buffering discovered URLs and distributing them to crawler workers.

Why it matters

At high volume, URL discovery is faster than crawling. A queue decouples discovery from fetching — the queue absorbs the burst.

Trade-off

Duplicate URLs will appear in the queue. Use a distributed bloom filter to deduplicate before enqueuing.

Real world

Googlebot uses a massive distributed URL frontier. Scrapy with Redis (scrapy-redis) implements this for Python crawlers.

Capacity math

Kafka handles 1M+ events/second. A 3-broker cluster can buffer billions of pending URLs.

In the real world: Googlebot uses a massive distributed URL frontier. Scrapy with Redis (scrapy-redis) implements this for Python crawlers.

Add a Load Balancer for Worker Pool

At peak crawl demand, scale worker nodes horizontally behind a load balancer to maximize concurrent fetch throughput.

What it does

A load balancer (or Kafka consumer group) distributes URL processing across a horizontally scaled worker pool.

Why it matters

At peak, crawl throughput is limited by worker count. Adding nodes scales fetch throughput linearly up to the queue partition count.

Trade-off

Politeness constraints (don't crawl one domain too fast) require per-domain rate limiting across all workers, not just per-worker.

Real world

Common Crawl runs thousands of crawler nodes on AWS Spot Instances. Each node is stateless and pulls from a shared queue.

Capacity math

Each worker fetches 50-200 pages/second. 100 workers = 5K-20K pages/second, sufficient for broad crawls.

In the real world: Common Crawl runs thousands of crawler nodes on AWS Spot Instances. Each node is stateless and pulls from a shared queue.

Crawler Worker Crash: crawler-1 crashes with 10K URLs in-flight. Without checkpointing, those URLs are lost from the frontier. How do you implement exactly-once URL processing using the message queue and idempotency keys?

§2Step 3 — Deep Dive

3Deep Dive

Crawler nodes are worker processes that fetch web pages via HTTP, parse HTML to extract links, and discover new URLs to crawl.

Storage	Throughput	Persistence	Priority support	Best for	Cost	Ops burden
Redis sorted set	100K ops/sec	Optional	Yes (score = priority)	Hot frontier, recent URLs ✓	Medium	Medium
Kafka topic	1M+ msg/sec	Yes (log)	Partial (priority topics)	Large-scale, distributed crawl	Medium	Medium
Postgres queue	10K/sec	Yes	Yes (ORDER BY priority)	Small crawl, exact dedup needed	Medium	Low
RabbitMQ	50K msg/sec	Yes	Yes (priority queues)	Complex routing, multiple crawlers	Medium	Medium
In-memory (single node)	>1M ops/sec	No	Yes	Prototype, single-crawler	Low	Low

URL frontier storage — Redis for hot queue, DB for durable frontier.

pythonPoliteness-aware crawler — domain-level rate limiting

import time
import redis
import requests
from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse

r = redis.Redis(decode_responses=True)
CRAWL_DELAY = 1.0  # seconds between requests to same domain

def can_crawl(url: str) -> bool:
    """Check robots.txt compliance."""
    domain = urlparse(url).netloc
    rp_key = f"robots:{domain}"
    robots_txt = r.get(rp_key)
    if not robots_txt:
        try:
            resp = requests.get(f"https://{domain}/robots.txt", timeout=5)
            robots_txt = resp.text
            r.setex(rp_key, 86400, robots_txt)  # cache 24h
        except Exception:
            return True  # assume crawlable if robots.txt unreachable
    rp = RobotFileParser()
    rp.parse(robots_txt.splitlines())
    return rp.can_fetch("*", url)

def crawl(url: str) -> str | None:
    domain = urlparse(url).netloc
    # Politeness: one request per second per domain (distributed across fleet)
    last_key = f"crawl:last:{domain}"
    last_crawl = float(r.get(last_key) or 0)
    wait = CRAWL_DELAY - (time.time() - last_crawl)
    if wait > 0:
        time.sleep(wait)
    if not can_crawl(url):
        return None
    resp = requests.get(url, timeout=10, headers={'User-Agent': 'MyCrawler/1.0'})
    r.set(last_key, time.time())
    return resp.text

Component	Why Add It	Tradeoff
Crawler Nodes	Web crawling is embarrassingly parallel — fetching URL A doesn't block fetching URL B.	Crawlers must respect robots.
Message Queue for URL Frontier	The frontier decouples URL discovery from URL fetching.	The frontier can grow to trillions of URLs (the web has ~50B pages).
Redis for URL Deduplication	Without deduplication, the same URL gets crawled thousands of times (many sites link to each other).	Bloom Filters have a small false positive rate — some new URLs are incorrectly flagged as seen and skipped.
Object Storage for Crawled Content	Crawling is expensive — you don't want to re-crawl just to reprocess with an improved indexing algorithm.	Object storage is eventually consistent on some providers.
Message Queue for URL Distribution	At high volume, URL discovery is faster than crawling.	Duplicate URLs will appear in the queue.
Load Balancer for Worker Pool	At peak, crawl throughput is limited by worker count.	Politeness constraints (don't crawl one domain too fast) require per-domain rate limiting across all workers, not just per-worker.

Design decision tradeoffs

Crawler Worker Crash

crawler-1 crashes with 10K URLs in-flight. Without checkpointing, those URLs are lost from the frontier. How do you implement exactly-once URL processing using the message queue and idempotency keys?

URL Frontier Explosion

Crawling a highly interconnected site causes exponential URL discovery — each page links to 100 new pages. The URL frontier grows to 1B entries, exhausting memory. How do you implement URL deduplication with Bloom filters and prioritization to bound frontier size?

Politeness Violation on Hot Domain

All 10 crawlers hammer the same domain simultaneously, violating robots.txt and getting IP-banned. How do you implement per-domain rate limiting, crawl delay queues, and distributed polite crawling?

The crawl frontier (URLs to crawl) is a queue. Add a Message Queue node. Seed URLs are the starting point — crawler workers dequeue a URL, fetch the page, extract new URLs, and enqueue them.

Without deduplication, the same URL will be crawled hundreds of times. Add a Redis cache to track visited URLs. Before enqueuing a URL, check Redis — if already visited, skip it.

Multiple crawler workers process URLs from the queue in parallel. Politeness: each worker maintains a per-domain delay (1 request/second max per domain). Store crawled content in Object Storage (S3-equivalent).

§3Step 4 — Wrap Up

4Wrap Up

Decision	Choice	Why
Crawler Nodes	Crawler nodes are worker processes that fetch web pages via HTTP, parse HTML to extract links, and discover new URLs to crawl.	Web crawling is embarrassingly parallel — fetching URL A doesn't block fetching URL B.
Message Queue for URL Frontier	The URL frontier is a message queue (Kafka) that stores discovered but not-yet-crawled URLs.	The frontier decouples URL discovery from URL fetching.
Redis for URL Deduplication	Redis stores the set of visited URLs (using a Bloom Filter for memory efficiency) so crawlers can quickly check if a URL has already been fetched.	Without deduplication, the same URL gets crawled thousands of times (many sites link to each other).
Object Storage for Crawled Content	Object storage (S3/GCS) stores the raw crawled content (HTML, metadata) durably and cheaply, making it available for the downstream indexing pipeline.	Crawling is expensive — you don't want to re-crawl just to reprocess with an improved indexing algorithm.
Message Queue for URL Distribution	A message queue (Kafka or SQS) acts as the URL frontier, buffering discovered URLs and distributing them to crawler workers.	At high volume, URL discovery is faster than crawling.
Load Balancer for Worker Pool	A load balancer (or Kafka consumer group) distributes URL processing across a horizontally scaled worker pool.	At peak, crawl throughput is limited by worker count.

Key design decisions

If the interviewer asks to scale 10×: 10x the load — architectural moves that work. Identify the single bottleneck (usually the database write path) and address it first before horizontal scaling.

10× Target10K RPSwhere your architecture must hold

What's next