Open on desktop

Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.

Best on desktop

Back to lesson

Payment Gateway

intermediateTransactionsReliability

Commerce·55 min read

Payment Gateway

1Understand the Problem & Establish Design Scope→2High-Level Design→3Deep Dive→4Wrap Up

TransactionsReliability

§1Step 2 — High-Level Design

2High-Level Design

Build a payment processor with idempotency, exactly-once semantics, and fraud detection.

System architecture overview

Stage 1 of 4Starting state — the problem to solve

Progressive build — add each component step by step

Add Postgres for Payment Records

Connect Postgres to durably store payment transactions with ACID guarantees.

What it does

Postgres stores the authoritative payment records: transaction ID, amount, currency, status (pending/completed/failed), payer, recipient, and timestamp.

Why it matters

Payments require ACID guarantees. If a charge succeeds at the payment processor but the API crashes before writing to the database, the payment is lost. Postgres transactions ensure both happen atomically or neither does.

Trade-off

Postgres is the write bottleneck for high-volume payments. At 10K transactions/second, Postgres write throughput is the limit. Shard by merchant ID or use Citus for horizontal scaling.

Real world

Stripe uses Postgres as their primary payment database. PayPal uses Oracle (similar ACID guarantees). Square uses Postgres. Robinhood uses Postgres for financial records.

Capacity math

Payments table: 10K TPS × 86,400 seconds × 500 bytes = 432GB/day. Partition by date; archive old data to Redshift. Postgres handles 10K TPS with proper indexing on a 16-core instance.

In the real world: Stripe uses Postgres as their primary payment database. PayPal uses Oracle (similar ACID guarantees). Square uses Postgres. Robinhood uses Postgres for financial records.

Add Redis for Idempotency Keys

Add Redis to store idempotency keys that prevent duplicate payment processing on network retries.

What it does

Redis stores idempotency keys — unique client-generated IDs that ensure a payment request is processed exactly once even if the client retries due to a network timeout.

Why it matters

In payments, network timeouts are dangerous: was the charge processed? The client doesn't know. Without idempotency, retrying causes a double charge. Redis stores the result of the first processing attempt — retries return the cached result instantly.

Trade-off

Idempotency keys expire (typically 24 hours). Retries after expiry are treated as new requests. For long-running payment disputes, a separate deduplication table in Postgres is needed.

Real world

Stripe's API requires idempotency keys for all mutating requests. Adyen uses idempotency keys for payment retries. All major payment APIs implement this pattern.

Capacity math

1M payments/day × 200 bytes per idempotency record × 24 hour TTL = 200MB peak Redis memory. Trivially fits on any Redis instance.

In the real world: Stripe's API requires idempotency keys for all mutating requests. Adyen uses idempotency keys for payment retries. All major payment APIs implement this pattern.

Add a Message Queue for Payment Events

Add Kafka to propagate payment events to downstream systems: ledger, notifications, fraud detection.

What it does

Kafka carries payment lifecycle events (PaymentInitiated, PaymentCompleted, PaymentFailed, RefundRequested) to all downstream consumers.

Why it matters

After processing a payment, the system must update the ledger, send a receipt email, trigger fraud analysis, and notify the merchant — all slow operations. Publishing to Kafka lets the payment API complete instantly while these happen asynchronously.

Trade-off

Downstream systems are eventually consistent with the payment API. The ledger might lag by seconds. For financial reporting, ensure the ledger consumer has exactly-once semantics (Kafka transactions or idempotent writes).

Real world

Stripe uses internal Kafka for payment event propagation. Square's event bus carries payment events. Adyen uses an event-driven architecture for payment lifecycle management.

Capacity math

10K payments/second × 2KB per event = 20MB/second Kafka throughput. With 3x replication, 60MB/second. Well within a single 3-broker Kafka cluster capacity.

In the real world: Stripe uses internal Kafka for payment event propagation. Square's event bus carries payment events. Adyen uses an event-driven architecture for payment lifecycle management.

Payment API Crash Mid-Transaction: api-1 crashes after charging the customer but before writing to the database. The client retries and is charged twice. How do you implement idempotency keys: api-1 stores charge_id → result in Redis before returning, so retries return the cached result?

§2Step 3 — Deep Dive

3Deep Dive

Postgres stores the authoritative payment records: transaction ID, amount, currency, status (pending/completed/failed), payer, recipient, and timestamp.

Strategy	Prevents double-charge?	Complexity	Auditability	Best for	Cost	Ops burden
Idempotency key (Redis SETNX)	Yes (client-provided key)	Low	No	API-level dedup for retries ✓	Medium	Low
Double-entry ledger	Yes (balance constraints)	Medium	Full audit trail	Financial systems, compliance	Low	Low
Distributed 2PC	Yes (across DBs)	High	Yes	Multi-bank transactions	Medium	High
Saga pattern	Yes (compensating txns)	High	Yes (event log)	Microservices, long-running flows	Medium	High
Optimistic locking (version)	Yes (retry on conflict)	Low	Partial	Low-contention payment flows	Low	Low

Payment consistency strategies — idempotency keys + double-entry are the foundation.

pythonIdempotent payment — Redis dedup + double-entry Postgres ledger

import redis, psycopg2, uuid, json
r = redis.Redis()

def process_payment(idempotency_key: str, from_account: str,
                    to_account: str, amount_cents: int) -> dict:
    lock_key = f"payment:idem:{idempotency_key}"
    if not r.set(lock_key, "processing", nx=True, ex=86400):
        result = r.get(f"payment:result:{idempotency_key}")
        return json.loads(result) if result else {"status": "processing"}

    conn = psycopg2.connect(DATABASE_URL)
    try:
        with conn:
            cur = conn.cursor()
            cur.execute("SELECT balance FROM accounts WHERE id = %s FOR UPDATE",
                        (from_account,))
            balance = cur.fetchone()[0]
            if balance < amount_cents:
                raise ValueError("Insufficient funds")

            txn_id = str(uuid.uuid4())
            cur.execute(
                "INSERT INTO ledger (txn_id, account_id, amount, type) VALUES (%s,%s,%s,'debit')",
                (txn_id, from_account, -amount_cents))
            cur.execute(
                "INSERT INTO ledger (txn_id, account_id, amount, type) VALUES (%s,%s,%s,'credit')",
                (txn_id, to_account, amount_cents))

        result = {"status": "success", "txn_id": txn_id}
        r.set(f"payment:result:{idempotency_key}", json.dumps(result), ex=86400)
        return result
    except Exception as e:
        r.delete(lock_key)
        raise

Component	Why Add It	Tradeoff
Postgres for Payment Records	Payments require ACID guarantees.	Postgres is the write bottleneck for high-volume payments.
Redis for Idempotency Keys	In payments, network timeouts are dangerous: was the charge processed?	Idempotency keys expire (typically 24 hours).
Message Queue for Payment Events	After processing a payment, the system must update the ledger, send a receipt email, trigger fraud analysis, and notify the merchant — all slow operations.	Downstream systems are eventually consistent with the payment API.

Design decision tradeoffs

Payment API Crash Mid-Transaction

api-1 crashes after charging the customer but before writing to the database. The client retries and is charged twice. How do you implement idempotency keys: api-1 stores charge_id → result in Redis before returning, so retries return the cached result?

Payment Processor Timeout

The network between api-1 and the external payment processor goes down after the charge succeeds remotely but before the response arrives. The client gets a timeout and retries. How do idempotency keys prevent double-charging and reconciliation jobs detect discrepancies?

Ledger Database Saturation

A flash sale triggers 10K concurrent payment attempts. The Postgres ledger database becomes a write bottleneck — each payment requires an atomic debit+credit. How do you use the message queue to buffer writes, apply optimistic locking, and partition the ledger by user ID?

The biggest risk in payments: the network times out after the charge is processed but before your server gets the response. Your server retries — double charge! Solution: idempotency keys. Redis stores {idempotencyKey → result} for 24 hours.

Before charging: SET Redis key idempotencyKey → "processing" with NX (only if not exists). If key already exists: return the stored result (already charged). If set succeeds: proceed with charge, store result, return.

Double-entry bookkeeping: every payment = debit payer account + credit payee account in a single Postgres transaction. This ensures the money never disappears and the ledger always balances. Never update balances in two separate transactions.

§3Step 4 — Wrap Up

4Wrap Up

Decision	Choice	Why
Postgres for Payment Records	Postgres stores the authoritative payment records: transaction ID, amount, currency, status (pending/completed/failed), payer, recipient, and timestamp.	Payments require ACID guarantees.
Redis for Idempotency Keys	Redis stores idempotency keys — unique client-generated IDs that ensure a payment request is processed exactly once even if the client retries due to a network timeout.	In payments, network timeouts are dangerous: was the charge processed?
Message Queue for Payment Events	Kafka carries payment lifecycle events (PaymentInitiated, PaymentCompleted, PaymentFailed, RefundRequested) to all downstream consumers.	After processing a payment, the system must update the ledger, send a receipt email, trigger fraud analysis, and notify the merchant — all slow operations.

Key design decisions

If the interviewer asks to scale 10×: 10x the load — architectural moves that work. Identify the single bottleneck (usually the database write path) and address it first before horizontal scaling.

10× Target5K RPSwhere your architecture must hold

What's next