Episode 9 — System Design / 9.11 — Real World System Design Problems
9.11.k Design a Payment System (Stripe / PayPal)
Problem Statement
Design a payment processing system like Stripe or PayPal that handles online transactions between merchants and customers. The system must guarantee exactly-once payment processing, handle multiple payment methods, detect fraud, maintain PCI-DSS compliance, and support refunds, chargebacks, and reconciliation.
1. Requirements
Functional Requirements
- Process payments (credit card, debit card, bank transfer, digital wallets)
- Support authorize-then-capture flow and direct charge flow
- Support merchant onboarding with KYC verification
- Provide checkout APIs for merchant integration
- Handle refunds (full and partial)
- Handle chargebacks and disputes
- Maintain transaction ledger with double-entry bookkeeping
- Send payment notifications via webhooks to merchants
- Support multiple currencies with exchange rate conversion
- Generate settlement reports for merchants
- Provide a merchant dashboard for transaction management
Non-Functional Requirements
- Exactly-once payment processing (idempotency guarantee)
- 99.999% availability for the payment path
- Transaction latency < 2 seconds end-to-end
- PCI-DSS Level 1 compliance for card data handling
- Support 10,000 transactions per second at peak
- Strong consistency for all financial data
- Comprehensive audit trail for every state change
- Disaster recovery with RPO = 0 (no data loss)
2. Capacity Estimation
Traffic
Daily transactions: 50 million
Transactions per second: 50M / 86,400 ~= 580 TPS
Peak TPS: ~10,000 (Black Friday, flash sales)
Webhook deliveries: 50M * 3 events avg = 150M webhooks/day
Webhook rate: ~1,750/sec
API calls per transaction: ~5 (auth, capture, status check, webhook, etc.)
Total API calls/sec: 580 * 5 = 2,900/sec average, 50K peak
Storage
Transaction record: ~2 KB (amounts, status, metadata, audit fields)
Daily transaction data: 50M * 2 KB = 100 GB/day
Yearly transaction data: ~36 TB
7-year retention: ~252 TB
Ledger entries: 50M * 4 entries avg = 200M entries/day * 200 bytes = 40 GB/day
Yearly ledger: ~14.6 TB
Card vault (tokenized): 500 million cards * 500 bytes = 250 GB
Merchant data: 2 million merchants * 5 KB = 10 GB
Audit log: ~200 GB/day (every state change logged)
Bandwidth
Inbound (payment requests): 580/sec * 2 KB = 1.16 MB/s
Outbound (responses): 580/sec * 1 KB = 0.58 MB/s
Webhook outbound: 1,750/sec * 1 KB = 1.75 MB/s
3. High-Level Architecture
+----------+ +-------------------+
| Merchant |------->| API Gateway |
| Server | | + Rate Limiter |
+----------+ | + Auth (API Keys) |
+--------+----------+
|
+------------------+------------------+
| | |
+------v------+ +------v------+ +------v------+
| Payment | | Merchant | | Webhook |
| Service | | Service | | Service |
+------+------+ +-------------+ +------+------+
| |
+------v------+ +-------v------+
| Risk/Fraud | | Webhook Queue|
| Engine | | (SQS/Kafka) |
+------+------+ +--------------+
|
+------v------+
| Payment | +------------------+
| Router |---------->| Card Vault |
+------+------+ | (PCI Isolated) |
| +------------------+
+------v------+------+------+
| | |
+--v---+ +----v---+ +----v----+
| Visa | |Mastercard| | Bank |
| PSP | | PSP | | ACH |
+------+ +--------+ +--------+
|
+------v------+
| Ledger | +------------------+
| Service |--->| Ledger DB |
+------+------+ | (Append-only) |
| +------------------+
+------v------+
| Settlement | +------------------+
| Service |--->| Settlement DB |
+-------------+ +------------------+
+------------------+ +------------------+
| Reconciliation | | Reporting |
| Service (batch) | | Service |
+------------------+ +------------------+
Payment Router Detail
The Payment Router selects the optimal PSP for each transaction:
+------------------+
| Payment Router |
+--------+---------+
|
| Routing rules:
| 1. Card BIN -> preferred PSP (Visa Direct for Visa cards)
| 2. Geographic routing (EU cards -> Adyen, US cards -> Stripe)
| 3. Cost optimization (cheapest PSP for transaction type)
| 4. Failover (if primary PSP down, route to secondary)
| 5. Load balancing (spread across PSPs when equivalent)
|
+----+----+----+----+
| | |
+---v---+ +--v----+ +--v--------+
| PSP-1 | | PSP-2 | | PSP-3 |
| (Visa | | (Adyen| | (Checkout |
| Net) | | )| | .com) |
+-------+ +-------+ +----------+
4. API Design
POST /api/v1/payments
Headers:
Authorization: Bearer <merchant_api_key>
Idempotency-Key: "unique-request-id-abc123"
Body: {
"amount": 4999, // in smallest currency unit (cents)
"currency": "USD",
"payment_method_id": "pm_card_visa_4242",
"customer_id": "cust_abc",
"description": "Order #12345",
"metadata": { "order_id": "12345" },
"capture": true // false for auth-only
}
Response 201: {
"payment_id": "pay_7xKq9mP3",
"status": "succeeded", // pending|succeeded|failed|cancelled
"amount": 4999,
"currency": "USD",
"payment_method": { "type": "card", "last4": "4242", "brand": "visa" },
"created_at": "2026-04-11T10:00:00Z",
"idempotency_key": "unique-request-id-abc123"
}
POST /api/v1/payments/{payment_id}/capture
Headers:
Authorization: Bearer <merchant_api_key>
Idempotency-Key: "capture-unique-id"
Body: { "amount": 4999 } // can capture less than authorized
Response 200: {
"payment_id": "pay_7xKq9mP3",
"status": "captured",
"captured_amount": 4999
}
POST /api/v1/payments/{payment_id}/refund
Headers:
Authorization: Bearer <merchant_api_key>
Idempotency-Key: "refund-unique-id-xyz"
Body: {
"amount": 2000, // partial refund; omit for full refund
"reason": "customer_request"
}
Response 201: {
"refund_id": "ref_abc123",
"payment_id": "pay_7xKq9mP3",
"amount": 2000,
"status": "succeeded",
"created_at": "2026-04-11T11:00:00Z"
}
POST /api/v1/payment_methods
Headers: Authorization: Bearer <merchant_api_key>
Body: {
"type": "card",
"card": {
"number": "4242424242424242", // tokenized client-side before reaching API
"exp_month": 12,
"exp_year": 2028,
"cvc": "123"
},
"customer_id": "cust_abc"
}
Response 201: {
"payment_method_id": "pm_card_visa_4242",
"type": "card",
"card": { "last4": "4242", "brand": "visa", "exp_month": 12, "exp_year": 2028 }
}
GET /api/v1/payments/{payment_id}
Response 200: { ... full payment object with event timeline ... }
GET /api/v1/payments?customer_id=cust_abc&limit=20&starting_after=pay_xyz
Response 200: { "data": [...], "has_more": true }
POST /api/v1/webhooks
Body: {
"url": "https://merchant.com/webhooks/payments",
"events": ["payment.succeeded", "payment.failed", "refund.created",
"chargeback.created", "chargeback.resolved"]
}
Response 201: {
"webhook_id": "wh_abc",
"secret": "whsec_xxx..." // for signature verification
}
5. Database Schema
Payments Table (PostgreSQL -- ACID required)
CREATE TABLE payments (
payment_id VARCHAR(20) PRIMARY KEY,
merchant_id VARCHAR(20) NOT NULL REFERENCES merchants(merchant_id),
customer_id VARCHAR(20),
amount BIGINT NOT NULL, -- in smallest currency unit
currency VARCHAR(3) NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'pending',
payment_method_id VARCHAR(50),
payment_method_type VARCHAR(20),
description VARCHAR(500),
metadata JSONB,
idempotency_key VARCHAR(255) UNIQUE,
risk_score DECIMAL(5,4),
risk_action VARCHAR(20), -- 'approve','review','decline'
failure_code VARCHAR(50),
failure_message VARCHAR(500),
authorized_amount BIGINT DEFAULT 0,
captured_amount BIGINT DEFAULT 0,
refunded_amount BIGINT DEFAULT 0,
fee_amount BIGINT DEFAULT 0,
net_amount BIGINT DEFAULT 0,
psp_name VARCHAR(50),
psp_reference VARCHAR(255), -- external PSP transaction ID
settlement_batch VARCHAR(50),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
version INTEGER DEFAULT 0 -- optimistic locking
);
CREATE INDEX idx_payments_merchant ON payments(merchant_id, created_at DESC);
CREATE INDEX idx_payments_customer ON payments(customer_id, created_at DESC);
CREATE INDEX idx_payments_idempotency ON payments(idempotency_key);
CREATE INDEX idx_payments_status ON payments(status) WHERE status IN ('pending', 'authorized');
CREATE INDEX idx_payments_settlement ON payments(settlement_batch) WHERE settlement_batch IS NOT NULL;
Ledger Entries Table (Append-Only)
CREATE TABLE ledger_entries (
entry_id BIGSERIAL PRIMARY KEY,
payment_id VARCHAR(20) NOT NULL,
entry_type VARCHAR(10) NOT NULL, -- 'debit' or 'credit'
account_id VARCHAR(50) NOT NULL, -- merchant, platform, customer
account_type VARCHAR(20) NOT NULL, -- 'merchant', 'platform_fee', 'psp_fee'
amount BIGINT NOT NULL,
currency VARCHAR(3) NOT NULL,
balance_after BIGINT NOT NULL,
description VARCHAR(500),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Append-only: no UPDATE or DELETE allowed (enforced by DB triggers)
CREATE INDEX idx_ledger_account ON ledger_entries(account_id, created_at DESC);
CREATE INDEX idx_ledger_payment ON ledger_entries(payment_id);
Refunds Table
CREATE TABLE refunds (
refund_id VARCHAR(20) PRIMARY KEY,
payment_id VARCHAR(20) NOT NULL REFERENCES payments(payment_id),
amount BIGINT NOT NULL,
currency VARCHAR(3) NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'pending',
reason VARCHAR(100),
idempotency_key VARCHAR(255) UNIQUE,
psp_reference VARCHAR(255),
initiated_by VARCHAR(20), -- 'merchant', 'system', 'admin'
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_refunds_payment ON refunds(payment_id);
Chargebacks Table
CREATE TABLE chargebacks (
chargeback_id VARCHAR(20) PRIMARY KEY,
payment_id VARCHAR(20) NOT NULL REFERENCES payments(payment_id),
amount BIGINT NOT NULL,
currency VARCHAR(3) NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'open',
-- open|evidence_submitted|won|lost|expired
reason_code VARCHAR(10), -- card network reason code
reason_description VARCHAR(500),
evidence_due_by TIMESTAMP,
evidence_submitted JSONB, -- documents, emails, logs
psp_reference VARCHAR(255),
opened_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
resolved_at TIMESTAMP,
outcome VARCHAR(10) -- 'won', 'lost'
);
CREATE INDEX idx_chargebacks_payment ON chargebacks(payment_id);
CREATE INDEX idx_chargebacks_status ON chargebacks(status) WHERE status = 'open';
CREATE INDEX idx_chargebacks_due ON chargebacks(evidence_due_by) WHERE status = 'open';
Idempotency Keys Table
CREATE TABLE idempotency_keys (
key VARCHAR(255) PRIMARY KEY,
merchant_id VARCHAR(20) NOT NULL,
request_hash VARCHAR(64) NOT NULL, -- SHA-256 of request body
response_code INTEGER,
response_body JSONB,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
expires_at TIMESTAMP DEFAULT (CURRENT_TIMESTAMP + INTERVAL '24 hours')
);
Merchants Table
CREATE TABLE merchants (
merchant_id VARCHAR(20) PRIMARY KEY,
business_name VARCHAR(200) NOT NULL,
email VARCHAR(255) NOT NULL,
country VARCHAR(2) NOT NULL,
default_currency VARCHAR(3) NOT NULL,
kyc_status VARCHAR(20) DEFAULT 'pending', -- pending|verified|rejected
risk_level VARCHAR(10) DEFAULT 'standard',
settlement_schedule VARCHAR(20) DEFAULT 'T+2',
fee_rate_percent DECIMAL(5,4) DEFAULT 0.0290, -- 2.9%
fee_fixed_cents INTEGER DEFAULT 30, -- $0.30
api_key_hash VARCHAR(64) NOT NULL,
webhook_url VARCHAR(2048),
webhook_secret VARCHAR(64),
payout_account JSONB, -- bank account details
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Payment Audit Log (Append-Only, Immutable)
CREATE TABLE payment_audit_log (
log_id BIGSERIAL PRIMARY KEY,
payment_id VARCHAR(20) NOT NULL,
event_type VARCHAR(50) NOT NULL,
old_status VARCHAR(20),
new_status VARCHAR(20),
event_data JSONB,
actor VARCHAR(100), -- 'system', 'merchant', 'admin'
ip_address VARCHAR(45),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Partitioned by month for efficient archival
-- NEVER UPDATE or DELETE rows
CREATE INDEX idx_audit_payment ON payment_audit_log(payment_id, created_at);
6. Deep Dive: Payment Processing Flow
Happy Path (Direct Charge)
Merchant API Gateway Payment Service Fraud Engine PSP (Visa)
| | | | |
|-- POST /payments -->| | | |
| (Idempotency-Key) | | | |
| |-- auth + rate -->| | |
| | limit check | | |
| | | | |
| | |-- check key --->| |
| | | idempotency | |
| | | table | |
| | | | |
| | |-- risk check -->| |
| | | |-- score=0.1 |
| | | | (low risk) |
| | | | |
| | |-- save payment | |
| | | status=pending| |
| | | | |
| | |-- charge card --|------------->|
| | | | |
| | |<-- approved ----|--------------|
| | | | |
| | |-- update payment| |
| | | status=success| |
| | | | |
| | |-- write ledger | |
| | | entries | |
| | | | |
| | |-- save idempotency response |
| | | | |
| | |-- enqueue webhook |
| | | | |
|<-- 201 Created -----|------------------| | |
Auth-Then-Capture Flow
Day 1: Authorization (reserve funds)
POST /api/v1/payments (capture: false)
-> Card network places a hold on customer's card
-> Payment status: "authorized"
-> No money moves yet
Day 3: Capture (after merchant ships the item)
POST /api/v1/payments/{id}/capture
-> Card network settles the held amount
-> Payment status: "captured"
-> Ledger entries created, money moves
This two-step flow prevents charging customers before goods are shipped.
Hotels, car rentals, and marketplaces commonly use this pattern.
Authorization hold expires after 7 days (configurable by card network).
State Machine
+----------+
| CREATED |
+----+-----+
|
+----v-----+
+------>| PENDING |<------+
| +----+-----+ |
| | |
(retry) +-----+-----+ (timeout)
| | | |
+----+---+ | +---+----+ |
| FAILED | | |DECLINED| |
+--------+ | +--------+ |
v
+------+------+
| AUTHORIZED |----> (auth-only flow)
+------+------+
|
+------v------+ +----------+
| CAPTURED / | | VOIDED |
| SUCCEEDED | | (auth |
+------+------+ | cancelled)
| +----------+
+------v-------+
| PARTIALLY |
| REFUNDED |
+------+-------+
|
+------v------+
| FULLY |
| REFUNDED |
+-------------+
Separate path (chargebacks):
SUCCEEDED --> DISPUTED (chargeback initiated by cardholder)
DISPUTED --> EVIDENCE_SUBMITTED (merchant provides evidence)
EVIDENCE_SUBMITTED --> DISPUTE_WON | DISPUTE_LOST
State Transition Rules (enforced in code):
VALID_TRANSITIONS = {
"created": ["pending"],
"pending": ["authorized", "succeeded", "failed", "declined"],
"authorized": ["captured", "voided"],
"captured": ["succeeded"],
"succeeded": ["partially_refunded", "disputed"],
"partially_refunded": ["fully_refunded", "disputed"],
"disputed": ["evidence_submitted"],
"evidence_submitted": ["dispute_won", "dispute_lost"],
}
7. Deep Dive: Idempotency (Exactly-Once Processing)
Why Idempotency is Critical
Problem scenario without idempotency:
1. Merchant sends POST /payments (charge $50)
2. Our system processes the charge, Visa approves
3. Network timeout before merchant receives response
4. Merchant retries POST /payments (charge $50 again!)
5. Customer is charged $100 instead of $50
With idempotency:
1. Merchant sends POST /payments with Idempotency-Key: "order-123"
2. System processes the charge, Visa approves
3. Network timeout before merchant receives response
4. Merchant retries with same Idempotency-Key: "order-123"
5. System finds existing result, returns saved response
6. Customer is charged only $50
Implementation
def process_payment(request):
key = request.headers["Idempotency-Key"]
merchant_id = request.auth.merchant_id
request_hash = sha256(canonicalize(request.body))
# Step 1: Check if we have seen this key before
existing = db.query(
"SELECT * FROM idempotency_keys WHERE key = %s AND merchant_id = %s",
key, merchant_id
)
if existing:
# Verify request body matches (prevent key reuse with different params)
if existing.request_hash != request_hash:
raise Error(422, "Idempotency key reused with different request body")
# Return cached response
return Response(existing.response_code, existing.response_body)
# Step 2: Acquire lock on the key (prevent concurrent duplicates)
lock_acquired = db.execute(
"INSERT INTO idempotency_keys (key, merchant_id, request_hash) "
"VALUES (%s, %s, %s) ON CONFLICT DO NOTHING RETURNING key",
key, merchant_id, request_hash
)
if not lock_acquired:
# Another request with same key is in progress
raise Error(409, "Request with this idempotency key is in progress")
# Step 3: Process payment normally
try:
result = do_payment_processing(request)
# Step 4: Save response for future idempotent lookups
db.execute(
"UPDATE idempotency_keys SET response_code=%s, response_body=%s "
"WHERE key = %s",
result.status_code, result.body, key
)
return result
except Exception as e:
# Remove key so merchant can retry
db.execute("DELETE FROM idempotency_keys WHERE key = %s", key)
raise
End-to-End Idempotency
Idempotency must be enforced at every boundary:
Merchant --> Our API : Idempotency-Key header
Our API --> PSP (Visa) : PSP-specific idempotency key (payment_id + "_charge")
Our API --> Ledger : Unique constraint on (payment_id, entry_type)
Our API --> Webhook : Event ID deduplication at merchant
If any layer retries, the next layer rejects the duplicate.
8. Deep Dive: Double-Entry Bookkeeping
Principle
Every financial transaction creates at least TWO ledger entries:
1. A DEBIT from one account
2. A CREDIT to another account
Sum of all debits == Sum of all credits (always balanced)
This invariant is checked continuously by the reconciliation system.
Payment Example ($50.00 payment)
Payment: Customer pays $50.00 to Merchant
Platform fee: 2.9% + $0.30 = $1.75
PSP fee: $0.25
Ledger entries:
+-------+-------------------+--------+--------+---------+
| Entry | Account | Debit | Credit | Balance |
+-------+-------------------+--------+--------+---------+
| 1 | Customer (source) | $50.00 | | |
| 2 | Platform Holding | | $50.00 | +$50.00 |
| 3 | Platform Holding | $48.00 | | +$2.00 |
| 4 | Merchant Balance | | $48.00 | +$48.00 |
| 5 | Platform Holding | $1.75 | | +$0.25 |
| 6 | Platform Revenue | | $1.75 | +$1.75 |
| 7 | Platform Holding | $0.25 | | $0.00 |
| 8 | PSP Payable | | $0.25 | +$0.25 |
+-------+-------------------+--------+--------+---------+
Verification: Total debits = $50 + $48 + $1.75 + $0.25 = $100
Total credits = $50 + $48 + $1.75 + $0.25 = $100 (balanced)
Refund Example ($50 full refund)
Ledger entries (reverse the original):
+-------+-------------------+--------+--------+
| Entry | Account | Debit | Credit |
+-------+-------------------+--------+--------+
| 9 | Merchant Balance | $48.00 | |
| 10 | Platform Holding | | $48.00 |
| 11 | Platform Revenue | $1.75 | |
| 12 | Platform Holding | | $1.75 |
| 13 | Platform Holding | $50.00 | |
| 14 | Customer (refund) | | $50.00 |
+-------+-------------------+--------+--------+
Note: PSP fee ($0.25) is typically NOT refunded.
Merchant absorbs the PSP fee on refunds.
9. Deep Dive: Fraud Detection
Multi-Layer Fraud Detection
Layer 1: Rule-Based (synchronous, < 5ms)
- Velocity checks: > 5 transactions in 1 minute from same card
- Amount thresholds: single transaction > $10,000
- Geographic anomaly: transaction from country different than card issuer
- BIN checks: known high-risk card BINs
- Blocked lists: known fraudulent cards, IPs, emails
Layer 2: ML Model (synchronous, < 100ms)
- Features:
- Transaction amount relative to customer average
- Time since last transaction
- Device fingerprint match
- IP geolocation vs billing address
- Merchant category risk score
- Historical chargeback rate for this card
- Behavioral signals (typing speed, mouse patterns on checkout)
- Output: risk score 0.0 - 1.0
Layer 3: Manual Review Queue (asynchronous)
- Transactions with score 0.7-0.9 queued for human review
- Score > 0.9 auto-declined
- Score < 0.3 auto-approved
- Score 0.3-0.7 approved with enhanced monitoring
Fraud Detection Pipeline:
+--------+ +----------+ +----------+ +---------+
| Request| --> | Rule | --> | ML | --> | Decision|
| | | Engine | | Scoring | | Engine |
+--------+ +----------+ +----------+ +---------+
| |
(< 5ms) +-----------+-----------+
| | |
Approve Review Decline
Risk Scoring Implementation
def compute_risk_score(transaction, customer_profile):
score = 0.0
# Velocity check
recent_txns = get_transactions(card_hash, last_minutes=5)
if len(recent_txns) > 3:
score += 0.3
# Amount anomaly
avg_amount = customer_profile.avg_transaction_amount
if avg_amount > 0 and transaction.amount > avg_amount * 5:
score += 0.2
# Geolocation mismatch
if transaction.ip_country != customer_profile.billing_country:
score += 0.2
# Device fingerprint
if transaction.device_id not in customer_profile.known_devices:
score += 0.15
# ML model adjustment (trained on historical fraud data)
ml_score = ml_model.predict(extract_features(transaction, customer_profile))
# Blend rule-based and ML scores
final_score = 0.4 * score + 0.6 * ml_score
return min(final_score, 1.0)
10. Deep Dive: PCI-DSS Compliance
Architecture for PCI Isolation
+----------------------------------------------------------+
| Public Zone |
| +------------+ +--------------+ +------------------+ |
| | API Gateway| | Merchant | | Merchant Portal | |
| | | | Checkout SDK | | (dashboard) | |
| +------+-----+ +------+-------+ +------------------+ |
| | | |
+---------+---------------+--------------------------------+
| |
+=========+===============+============================+
| PCI-DSS Scope (isolated network segment) |
| |
| +------------+ +------------------+ |
| | Tokenizer | | Card Vault | |
| | Service |<--->| (encrypted at | |
| | | | rest, HSM keys) | |
| +------+-----+ +------------------+ |
| | |
| +------v-----------+ +---------------------+ |
| | Payment Processor| | Key Management | |
| | (PSP connector) | | (HSM - Hardware | |
| +------------------+ | Security Module) | |
| +---------------------+ |
+======================================================+
Tokenization Flow
1. Client-side: Card number entered in merchant's checkout form
(within an iframe served from OUR domain, not the merchant's)
2. JavaScript SDK sends card directly to our Tokenizer over TLS 1.3
(card number NEVER touches the merchant's servers)
3. Tokenizer encrypts card with HSM-managed key, stores in Card Vault
4. Returns token: "pm_card_visa_4242"
5. Merchant uses token for all subsequent API calls
6. Merchant never sees or stores full card number
Key outcome: merchant does NOT need PCI-DSS certification
only WE need PCI-DSS Level 1 compliance
Data Classification
+---------------------+------------------+------------------------+
| Data Type | Classification | Storage |
+---------------------+------------------+------------------------+
| Full card number | PCI Restricted | Card Vault (encrypted) |
| CVV/CVC | PCI Restricted | NEVER stored |
| Card expiry | PCI Sensitive | Card Vault (encrypted) |
| Cardholder name | PCI Sensitive | Card Vault (encrypted) |
| Payment token | Non-PCI | Payment DB |
| Transaction amount | Non-PCI | Payment DB |
| Last 4 digits | Non-PCI | Payment DB |
+---------------------+------------------+------------------------+
11. Deep Dive: Refunds and Chargebacks
Refund Flow
Merchant Payment Service PSP (Visa) Ledger
| | | |
|-- POST /refund --------->| | |
| (Idempotency-Key) | | |
| |-- validate: -------->| |
| | refund <= captured | |
| | refund <= remaining | |
| | | |
| |-- create refund | |
| | status=pending | |
| | | |
| |-- reverse charge --->| |
| | | |
| |<-- approved ---------| |
| | | |
| |-- update refund | |
| | status=succeeded | |
| | | |
| |-- reverse ledger ----|----------------->|
| | entries | |
| | | |
| |-- update payment | |
| | refunded_amount | |
| | | |
| |-- enqueue webhook | |
| | "refund.succeeded" | |
| | | |
|<-- 201 refund created --| | |
Timing: refund to customer's card takes 5-10 business days
(funds move: us -> card network -> issuing bank -> customer)
Chargeback Flow
A chargeback occurs when a cardholder disputes a charge with their bank.
Timeline:
Day 0: Customer calls bank: "I did not make this purchase"
Day 1: Issuing bank files chargeback with card network
Day 2: Card network notifies us via PSP
Day 3: We notify merchant via webhook: "chargeback.created"
Day 3-21: Merchant submits evidence (receipt, shipping proof, logs)
Day 21: Evidence deadline
Day 30-45: Card network reviews evidence
Day 45-75: Decision: merchant wins or loses
Our system:
1. Receive chargeback notification from PSP
2. Create chargeback record in database
3. Immediately debit merchant balance (provisional hold)
4. Notify merchant via webhook + email
5. Provide evidence submission API
6. Track deadline and send reminders
7. On resolution:
- Merchant wins: credit merchant balance back
- Merchant loses: debit becomes permanent; fee charged ($15-25)
Chargeback rate monitoring:
If merchant chargeback rate > 1%: flag for review
If merchant chargeback rate > 2%: restrict merchant account
Card networks penalize platforms with high chargeback rates
12. Deep Dive: Reconciliation
Why Reconciliation
Three sources of truth that can diverge:
1. Our internal ledger (what we think happened)
2. PSP records (what Visa/Mastercard thinks happened)
3. Bank settlement files (what the bank actually moved)
Reconciliation ensures all three agree.
Reconciliation Pipeline
Daily batch process (runs at 2:00 AM UTC):
Step 1: Fetch PSP settlement files
- Download CSV/SFTP files from Visa, Mastercard, etc.
- Parse into standardized format
Step 2: Match internal records
For each PSP record:
- Find matching payment by psp_reference
- Compare: amount, currency, status, timestamp
Step 3: Identify discrepancies
- MATCHED: Internal record matches PSP record
- UNMATCHED_INTERNAL: We have a record, PSP does not
- UNMATCHED_EXTERNAL: PSP has a record, we do not
- AMOUNT_MISMATCH: Amounts differ
- STATUS_MISMATCH: Status differs
Step 4: Generate reconciliation report
+-------------------------------------------+
| Daily Reconciliation Report |
| Date: 2026-04-11 |
+-------------------------------------------+
| Total transactions: 50,234,891 |
| Matched: 50,234,650 (99.99%)|
| Unmatched (internal): 142 |
| Unmatched (external): 87 |
| Amount mismatches: 12 |
+-------------------------------------------+
Step 5: Auto-resolve known patterns
- Timing differences (processed at midnight boundary)
- Currency rounding (< $0.01 difference)
- Delayed PSP processing (appears next day)
Step 6: Escalate unresolved to finance team
- Auto-create JIRA ticket for each unresolved discrepancy
- SLA: resolve within 48 hours
Balance Reconciliation
Continuous verification (every hour):
For each merchant account:
Calculated balance = SUM(credits) - SUM(debits) in ledger
Stored balance = merchants.current_balance
If calculated != stored:
ALERT: balance drift detected
Action: freeze payouts, investigate
This catches bugs where a ledger write succeeded but the
balance update did not (or vice versa).
13. Webhook Delivery System
Reliable Webhook Delivery
Payment Event --> Outbox Table --> Outbox Worker --> Kafka --> Webhook Worker
|
Merchant Endpoint
Using the Outbox Pattern:
BEGIN TRANSACTION
UPDATE payments SET status = 'succeeded' ...
INSERT INTO outbox (event_type, payload) VALUES ('payment.succeeded', ...)
COMMIT
Outbox worker reads outbox table, publishes to Kafka, marks as sent.
This ensures the payment update and event are atomic.
Webhook Worker logic:
1. Consume event from Kafka
2. Construct webhook payload
3. Sign payload with merchant's webhook secret
signature = HMAC-SHA256(webhook_secret, timestamp + "." + payload)
4. POST to merchant's webhook URL
Headers:
X-Webhook-Signature: sha256=abc123...
X-Webhook-Timestamp: 1681200000
X-Webhook-Id: evt_abc123
5. If merchant responds 2xx: mark delivered, commit offset
6. If merchant responds 4xx/5xx or timeout: schedule retry
Retry schedule (exponential backoff):
Attempt 1: immediate
Attempt 2: 1 minute
Attempt 3: 5 minutes
Attempt 4: 30 minutes
Attempt 5: 2 hours
Attempt 6: 12 hours
Attempt 7: 24 hours (final attempt)
After 7 failures: mark as "failed", notify merchant via email.
Merchant can replay missed webhooks from their dashboard.
Webhook Payload Example
{
"id": "evt_abc123",
"type": "payment.succeeded",
"created": 1681200000,
"data": {
"payment_id": "pay_7xKq9mP3",
"amount": 4999,
"currency": "USD",
"status": "succeeded",
"merchant_id": "merch_xyz"
}
}
14. Payment Gateway Integration
Multi-PSP Strategy
Why multiple PSPs?
1. Redundancy: if one PSP goes down, route to another
2. Cost optimization: different PSPs charge different rates
3. Authorization rates: some PSPs have higher approval rates for certain
card types or regions
4. Coverage: some PSPs support local payment methods others do not
Routing logic:
def select_psp(payment):
# Check PSP health
healthy_psps = [p for p in ALL_PSPS if health_check(p)]
# Filter by payment method support
compatible = [p for p in healthy_psps if p.supports(payment.method)]
# Score by: cost, authorization rate, latency
scored = []
for psp in compatible:
score = (
0.4 * psp.auth_rate_for(payment.card_bin)
+ 0.3 * (1 / psp.cost_for(payment))
+ 0.3 * (1 / psp.avg_latency)
)
scored.append((psp, score))
return max(scored, key=lambda x: x[1])[0]
Failover:
If primary PSP returns error or times out (> 5 seconds):
1. Retry once with primary PSP (network glitch)
2. If still fails, route to secondary PSP
3. Record the failover for monitoring
4. If primary PSP fails > 10% of requests in 5 min: circuit breaker OPEN
(all traffic goes to secondary until primary recovers)
15. Scaling Considerations
Database Scaling
Strategy: Shard by merchant_id
Shard routing:
shard_id = consistent_hash(merchant_id) % num_shards
Shard layout (16 shards):
Each shard: PostgreSQL primary + 2 synchronous replicas (zero data loss)
Read scaling: read replicas for dashboard/reporting queries
Write scaling: sharding by merchant_id distributes writes evenly
Hot Merchant Problem
Problem: A single large merchant (e.g., Amazon) generates 10% of all traffic.
One shard becomes a bottleneck.
Solution: Sub-shard large merchants by customer_id within the merchant shard.
shard_id = hash(merchant_id + customer_id) % num_sub_shards
Alternatively: Dedicated shard cluster for top 10 merchants.
Payment Processing Pipeline
For 10,000 TPS peak:
API Gateway: 20 instances (500 TPS each)
Payment Service: 40 instances (250 TPS each)
Fraud Engine: 20 instances (500 TPS each)
PSP Connectors: 10 per PSP (load balanced)
Database: 16 shards * 3 replicas = 48 DB instances
Webhook Workers: 30 instances
Kafka: 12 brokers, 64 partitions for payment events
Total infrastructure: ~200 instances for payment path
Multi-Currency Support
Payment in EUR, merchant settles in USD:
1. Customer pays 40.00 EUR
2. System converts: 40.00 EUR * 1.08 (rate) = 43.20 USD
3. Apply fee: 43.20 * 2.9% + 0.30 = $1.55 fee
4. Merchant receives: 43.20 - 1.55 = $41.65 USD
Exchange Rate Service:
- Rates fetched from multiple providers every minute
- Cached with 5-minute TTL
- Locked at time of payment (quoted rate honored for 30 minutes)
- Spread applied (0.5-1% markup on mid-market rate)
Schema addition:
payment_amount: 4000 (in EUR cents)
payment_currency: "EUR"
settlement_amount: 4320 (in USD cents)
settlement_currency: "USD"
exchange_rate: 1.08
rate_locked_at: "2026-04-11T10:00:00Z"
Geographic Distribution
Active-Active across 3 regions:
+------------------+ +------------------+ +------------------+
| US-East | | EU-West | | AP-Southeast |
| - Full stack | | - Full stack | | - Full stack |
| - DB primary | | - DB primary | | - DB primary |
| (US merchants)| | (EU merchants)| | (APAC merch.) |
+------------------+ +------------------+ +------------------+
Merchant assigned to region by country.
Within-region: synchronous replication (RPO = 0).
Cross-region: async replication for disaster recovery (RPO = 500ms).
16. Key Tradeoffs
| Decision | Option A | Option B | Our Choice |
|---|---|---|---|
| Database | NoSQL (scale) | PostgreSQL (ACID) | PostgreSQL |
| Idempotency storage | Redis (fast, volatile) | DB (durable) | DB |
| Fraud check | Sync only (slow, safe) | Async only (fast, risky) | Both layers |
| Ledger model | Single-entry (simple) | Double-entry (auditable) | Double-entry |
| Webhook delivery | At-most-once | At-least-once | At-least-once |
| Card storage | Merchant-side | Centralized vault | Central vault |
| Settlement timing | Real-time | Batch (T+2) | Batch (T+2) |
| Consistency model | Eventual | Strong | Strong |
| Event publishing | Dual-write (simple) | Outbox pattern (safe) | Outbox |
| PSP strategy | Single PSP | Multi-PSP routing | Multi-PSP |
17. Failure Scenarios and Mitigations
Scenario Mitigation
---------------------------------------------------------------------------
PSP timeout during charge Store as "pending"; query PSP for status;
idempotent retry
Double charge (network retry) Idempotency key prevents duplicate processing
DB failure mid-transaction Saga pattern: compensating transaction to PSP
Webhook endpoint down Exponential retry for 24 hours; manual replay
Fraud model false positive Manual review queue; merchant override option
Reconciliation mismatch Auto-resolve known patterns; escalate unknowns
HSM failure Redundant HSM pair; failover in < 1s
Surge in chargebacks Auto-pause merchant; escalate to risk team
Data center failure Active-active in 3 regions; DNS failover
Exchange rate stale Lock rate at quote time; refresh every minute
Outbox worker lag Monitor outbox table size; auto-scale workers
Key Takeaways
- Idempotency is non-negotiable in payment systems -- every write endpoint must accept an idempotency key and guarantee exactly-once semantics, including the calls to external PSPs.
- Double-entry bookkeeping ensures every dollar is accounted for and enables automated reconciliation with external PSP records.
- PCI-DSS compliance is achieved by isolating card data in a tokenization vault -- merchants never see raw card numbers, drastically reducing audit scope.
- The outbox pattern solves the dual-write problem -- database state changes and event publishing become atomic, which is critical for webhook reliability.
- Strong consistency is chosen over availability for financial data -- unlike most other systems, payment systems cannot tolerate eventual consistency for the transaction path.
- Multi-PSP routing with automatic failover ensures that a single PSP outage does not take down the entire payment platform.