Open on desktop
Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.
Bloom Filter
§1Step 2 — High-Level Design
Implement a space-efficient probabilistic data structure. Used in databases, CDNs, and spam filters.
Place a Bloom Filter between the crawler and the URL queue to avoid re-crawling URLs that have already been visited.
A Bloom Filter is a probabilistic data structure that answers 'have I seen this item before?' with zero false negatives and a small, tunable false positive rate.
A web crawler visits billions of URLs. Storing every URL in a database to check for duplicates would require terabytes of RAM or slow disk lookups. A Bloom Filter stores the same 1 billion URLs in ~1.2GB of memory with 1% false positive rate.
False positives mean some unseen URLs get skipped (harmless for crawlers). False negatives are impossible — if the filter says 'not seen', it definitely hasn't been.
Google's web crawler, Apache Cassandra (for SSTables), and Chrome (for Safe Browsing) all use Bloom Filters at massive scale.
1 billion URLs with 1% false positive rate requires ~1.2GB of RAM. At 10 bits per element, it supports ~10B URL fingerprints per GB.
At high traffic, back the bloom filter with a cache layer to absorb repeated lookups without hitting the database.
A Redis cache stores positive membership results so confirmed members skip the database entirely.
At high traffic, even bloom-filter misses create database load. A cache on confirmed members reduces DB reads by 90%+.
Cache invalidation is needed when members are removed. TTL-based expiry is simplest.
Medium and Pinterest use Redis caches in front of their persistence layers for hot content.
Redis handles 1M+ ops/second. A 16GB instance can cache ~100M string entries.
At peak load, distribute requests across multiple filter nodes to scale horizontally.
A load balancer routes membership check requests across multiple filter+cache nodes.
A single node caps out. At peak, multiple stateless nodes behind a load balancer give linear throughput scaling.
All nodes must have identical filter data. Use a shared Redis for the filter bits, or sync on startup.
Google's Bigtable uses distributed bloom filters coordinated via a central metadata service.
Each filter node can handle ~100K checks/second. Three nodes behind a LB = 300K checks/second.
§2Step 3 — Deep Dive
A Bloom Filter is a probabilistic data structure that answers 'have I seen this item before?' with zero false negatives and a small, tunable false positive rate.
| Approach | Memory for 1B items | False negatives? | Lookup cost | Best for | Cost | Ops burden |
|---|---|---|---|---|---|---|
| Hash Set (in-memory) | ~8 GB | No | O(1) | Small datasets, exact dedup | Low | Low |
| Bloom Filter | ~1.2 GB (1% FPR) | No | O(k) hash fns | Large-scale crawler dedup ✓ | Low | Low |
| Redis SET | ~50 bytes/URL | No | O(1) network | Distributed, but 40× more RAM | Medium | Medium |
| Postgres lookup | N/A (disk) | No | O(log n) + network | Persistent, audit trail | Medium | Low |
| Cuckoo Filter | ~1 GB | No | O(1) | Need deletions + low FPR | Low | Low |
Deduplication approaches — Bloom filter wins on memory at scale.
import math
import mmh3
from bitarray import bitarray
class BloomFilter:
def __init__(self, capacity: int, fp_rate: float = 0.01):
# Optimal bit count: m = -n * ln(p) / (ln 2)^2
m = int(-capacity * math.log(fp_rate) / (math.log(2) ** 2))
# Optimal hash count: k = (m/n) * ln(2)
self.k = int((m / capacity) * math.log(2))
self.bits = bitarray(m)
self.bits.setall(0)
self.m = m
def add(self, item: str):
for i in range(self.k):
idx = mmh3.hash(item, i) % self.m
self.bits[idx] = 1
def might_contain(self, item: str) -> bool:
return all(
self.bits[mmh3.hash(item, i) % self.m]
for i in range(self.k)
)
# 1B URLs, 1% FPR: m ≈ 9.6B bits = 1.2 GB, k = 7 hash functions
bf = BloomFilter(capacity=1_000_000_000, fp_rate=0.01)
bf.add("https://example.com/page")
print(bf.might_contain("https://example.com/page")) # True
print(bf.might_contain("https://not-seen.com")) # False (probably)| Component | Why Add It | Tradeoff |
|---|---|---|
| Bloom Filter | A web crawler visits billions of URLs. | False positives mean some unseen URLs get skipped (harmless for crawlers). |
| Cache Layer | At high traffic, even bloom-filter misses create database load. | Cache invalidation is needed when members are removed. |
| Load Balancer | A single node caps out. | All nodes must have identical filter data. |
Design decision tradeoffs
A write spike inserts 10x normal elements into the Bloom filter, pushing the false-positive rate from 1% to 40%. Lookups start returning false positives and users see incorrect cache misses. How do you detect saturation and rotate to a fresh filter?
The backing database goes down. The Bloom filter may say an item exists (possibly a false positive), but the DB lookup fails. How do you handle the fallback: serve stale data, queue the request, or surface an error to the user?
Under 100K inserts/second the multiple hash function computations (5+ hashes per element) become CPU-bound, doubling insert latency to 2ms. How do you optimize: use faster hash families, reduce hash count, or pre-batch inserts?
§3Step 4 — Wrap Up
| Decision | Choice | Why |
|---|---|---|
| Bloom Filter | A Bloom Filter is a probabilistic data structure that answers 'have I seen this item before? | A web crawler visits billions of URLs. |
| Cache Layer | A Redis cache stores positive membership results so confirmed members skip the database entirely. | At high traffic, even bloom-filter misses create database load. |
| Load Balancer | A load balancer routes membership check requests across multiple filter+cache nodes. | A single node caps out. |
Key design decisions