Open on desktop

Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.

Best on desktop

Back to lesson

Bloom Filter

beginnerAlgorithmsData Structures

Fundamentals·20 min read

Bloom Filter

1Understand the Problem & Establish Design Scope→2High-Level Design→3Deep Dive→4Wrap Up

AlgorithmsData Structures

§1Step 2 — High-Level Design

2High-Level Design

Implement a space-efficient probabilistic data structure. Used in databases, CDNs, and spam filters.

System architecture overview

Stage 1 of 4Starting state — the problem to solve

Progressive build — add each component step by step

Add a Bloom Filter

Place a Bloom Filter between the crawler and the URL queue to avoid re-crawling URLs that have already been visited.

What it does

A Bloom Filter is a probabilistic data structure that answers 'have I seen this item before?' with zero false negatives and a small, tunable false positive rate.

Why it matters

A web crawler visits billions of URLs. Storing every URL in a database to check for duplicates would require terabytes of RAM or slow disk lookups. A Bloom Filter stores the same 1 billion URLs in ~1.2GB of memory with 1% false positive rate.

Trade-off

False positives mean some unseen URLs get skipped (harmless for crawlers). False negatives are impossible — if the filter says 'not seen', it definitely hasn't been.

Real world

Google's web crawler, Apache Cassandra (for SSTables), and Chrome (for Safe Browsing) all use Bloom Filters at massive scale.

Capacity math

1 billion URLs with 1% false positive rate requires ~1.2GB of RAM. At 10 bits per element, it supports ~10B URL fingerprints per GB.

In the real world: Google's web crawler, Apache Cassandra (for SSTables), and Chrome (for Safe Browsing) all use Bloom Filters at massive scale.

Add a Cache Layer

At high traffic, back the bloom filter with a cache layer to absorb repeated lookups without hitting the database.

What it does

A Redis cache stores positive membership results so confirmed members skip the database entirely.

Why it matters

At high traffic, even bloom-filter misses create database load. A cache on confirmed members reduces DB reads by 90%+.

Trade-off

Cache invalidation is needed when members are removed. TTL-based expiry is simplest.

Real world

Medium and Pinterest use Redis caches in front of their persistence layers for hot content.

Capacity math

Redis handles 1M+ ops/second. A 16GB instance can cache ~100M string entries.

In the real world: Medium and Pinterest use Redis caches in front of their persistence layers for hot content.

Add a Load Balancer

At peak load, distribute requests across multiple filter nodes to scale horizontally.

What it does

A load balancer routes membership check requests across multiple filter+cache nodes.

Why it matters

A single node caps out. At peak, multiple stateless nodes behind a load balancer give linear throughput scaling.

Trade-off

All nodes must have identical filter data. Use a shared Redis for the filter bits, or sync on startup.

Real world

Google's Bigtable uses distributed bloom filters coordinated via a central metadata service.

Capacity math

Each filter node can handle ~100K checks/second. Three nodes behind a LB = 300K checks/second.

In the real world: Google's Bigtable uses distributed bloom filters coordinated via a central metadata service.

Filter Saturation Under Write Spike: A write spike inserts 10x normal elements into the Bloom filter, pushing the false-positive rate from 1% to 40%. Lookups start returning false positives and users see incorrect cache misses. How do you detect saturation and rotate to a fresh filter?

§2Step 3 — Deep Dive

3Deep Dive

A Bloom Filter is a probabilistic data structure that answers 'have I seen this item before?' with zero false negatives and a small, tunable false positive rate.

Approach	Memory for 1B items	False negatives?	Lookup cost	Best for	Cost	Ops burden
Hash Set (in-memory)	~8 GB	No	O(1)	Small datasets, exact dedup	Low	Low
Bloom Filter	~1.2 GB (1% FPR)	No	O(k) hash fns	Large-scale crawler dedup ✓	Low	Low
Redis SET	~50 bytes/URL	No	O(1) network	Distributed, but 40× more RAM	Medium	Medium
Postgres lookup	N/A (disk)	No	O(log n) + network	Persistent, audit trail	Medium	Low
Cuckoo Filter	~1 GB	No	O(1)	Need deletions + low FPR	Low	Low

Deduplication approaches — Bloom filter wins on memory at scale.

pythonBloom Filter — bitarray + k hash functions

import math
import mmh3
from bitarray import bitarray

class BloomFilter:
    def __init__(self, capacity: int, fp_rate: float = 0.01):
        # Optimal bit count: m = -n * ln(p) / (ln 2)^2
        m = int(-capacity * math.log(fp_rate) / (math.log(2) ** 2))
        # Optimal hash count: k = (m/n) * ln(2)
        self.k = int((m / capacity) * math.log(2))
        self.bits = bitarray(m)
        self.bits.setall(0)
        self.m = m

    def add(self, item: str):
        for i in range(self.k):
            idx = mmh3.hash(item, i) % self.m
            self.bits[idx] = 1

    def might_contain(self, item: str) -> bool:
        return all(
            self.bits[mmh3.hash(item, i) % self.m]
            for i in range(self.k)
        )

# 1B URLs, 1% FPR: m ≈ 9.6B bits = 1.2 GB, k = 7 hash functions
bf = BloomFilter(capacity=1_000_000_000, fp_rate=0.01)
bf.add("https://example.com/page")
print(bf.might_contain("https://example.com/page"))  # True
print(bf.might_contain("https://not-seen.com"))       # False (probably)

Component	Why Add It	Tradeoff
Bloom Filter	A web crawler visits billions of URLs.	False positives mean some unseen URLs get skipped (harmless for crawlers).
Cache Layer	At high traffic, even bloom-filter misses create database load.	Cache invalidation is needed when members are removed.
Load Balancer	A single node caps out.	All nodes must have identical filter data.

Design decision tradeoffs

Filter Saturation Under Write Spike

A write spike inserts 10x normal elements into the Bloom filter, pushing the false-positive rate from 1% to 40%. Lookups start returning false positives and users see incorrect cache misses. How do you detect saturation and rotate to a fresh filter?

Database Unavailable

The backing database goes down. The Bloom filter may say an item exists (possibly a false positive), but the DB lookup fails. How do you handle the fallback: serve stale data, queue the request, or surface an error to the user?

Hash Function Bottleneck

Under 100K inserts/second the multiple hash function computations (5+ hashes per element) become CPU-bound, doubling insert latency to 2ms. How do you optimize: use faster hash families, reduce hash count, or pre-batch inserts?

Checking the visited-URLs database on every crawl is expensive at scale. A Bloom filter can answer "definitely not visited" or "probably visited" in O(1) with no DB hit for the "definitely not" case.

Add a Bloom Filter node. Connect it in front of the Visited URLs DB. If the Bloom filter says "probably visited", still check the DB (to avoid false positive crawl-skipping). If "definitely not visited", crawl it.

Memory sizing: for 1 billion URLs with 1% false positive rate, a Bloom filter requires only ~1.2GB. A hash set would require 8GB+. This is the tradeoff: probabilistic + compact vs. exact + expensive.

§3Step 4 — Wrap Up

4Wrap Up

Decision	Choice	Why
Bloom Filter	A Bloom Filter is a probabilistic data structure that answers 'have I seen this item before?	A web crawler visits billions of URLs.
Cache Layer	A Redis cache stores positive membership results so confirmed members skip the database entirely.	At high traffic, even bloom-filter misses create database load.
Load Balancer	A load balancer routes membership check requests across multiple filter+cache nodes.	A single node caps out.

Key design decisions

If the interviewer asks to scale 10×: From prototype to planet-scale. Introduce consistent hashing to redistribute load as you add nodes — minimize cache/shard remapping.

10× Target500K RPSwhere your architecture must hold

What's next