Distributed Transactions: An Intuitive Deep Dive
“Ensuring all-or-nothing across machines without melting clocks or tearing your hair out.”
1. The Problem
You have two services, OrderSvc
and InventorySvc
, each with its own database. A single “buy 100 shirts” operation spans:
- OrderSvc: insert orders(order_id, user_id, qty)
- InventorySvc: update stock(item_id) -= qty
We require atomicity across services. Formally, let each service’s local transaction be T₁ and T₂. We need a global transaction G = {T₁, T₂} satisfying:
- Atomicity: [ commit(G) ⇔ commit(T₁) ∧ commit(T₂) ]
- Consistency: database invariants preserved
- Isolation: no interleaving from other global transactions
- Durability: once acked, changes survive crashes
2. Two-Phase Commit (2PC) in Pseudocode
Coordinator pseudocode:
// Phase 1: Prepare for each participant P: send P.prepare(GID) await P.vote ∈ {YES, NO} // timeout ⇒ NO // Phase 2: Commit/Abort if all votes == YES: decision = COMMIT else: decision = ABORT for each participant P: send P.decision(decision)
Participant pseudocode:
on prepare(GID): write UNDO/REDO to local log if local checks pass → reply YES else → reply NO on decision(COMMIT): apply REDO, release locks on decision(ABORT): apply UNDO, release locks
Latency ≈ 2 × RTT + disk-fsyncs. Quorum: coordinator must hear N votes; blocking if coordinator dies.
3. Math of Quorum & Fault Tolerance
For N replicas coordinating via Paxos/Raft:
- Write quorum W ≥ ⌊N/2⌋ + 1
- Read quorum R ≥ ⌊N/2⌋ + 1
- Guarantees W + R > N ⇒ any read sees latest write
In 2PC, you implicitly require all participants (strict quorum = N). If one is down, the entire G stalls.
4. Three-Phase Commit (3PC) & Timeouts
3PC splits prepare into:
- CanCommit
- PreCommit (write REDO/UNDO and fence)
- DoCommit
By introducing per-phase timeouts, participants avoid indefinite blocking—yet under network partitions + coordinator app crash, they can still diverge. No free lunch: stronger liveness at cost of protocol complexity.
5. The Saga Pattern
Instead of global locks, define a state machine:
State: [START] → [ORDER_CREATED] → [PAID] → [SHIPPED] → [COMPLETED] Failure at step k ⇒ trigger Compensation[k…1]
Compensation: an idempotent inverse of each step:
- If shipping fails, run
refundPayment(order_id)
. - Compensation graph may be cyclic—ensure idempotency.
This is eventual consistency:
6. Advanced: Deterministic Ordering (Calvin)
Calvin’s workflow:
- Sequencer assigns a strict, global sequence number to every transaction Tᵢ → σᵢ
- Each node executes transactions in ascending σ order without further coordination
- Conflict detection happens offline, at the sequencing layer
Benefit: per-transaction execution is lock-free and local—latency = local compute + log shipping.
7. Hybrid Logical Clocks (HLC) & Bounded Time
RealTime clocks drift—enter HLC:
Use HLC to assign commit timestamps, enabling:
- Monotonic ordering
- Bounded skew guarantees w/o specialized hardware
Spanner’s TrueTime is an HLC + GPS+NTP solution with ε bound.
8. Latency vs. Consistency Trade-off
Let L = one-way latency, σ = clock-uncertainty, f = failure probability.
Protocol | Commit Latency | Availability under P |
---|---|---|
2PC | ≈ 2L + O(fs, logs) | Blocks on coordinator |
3PC | ≈ 3L + O(fs, logs) | Reduced blocking |
Spanner | ≈ 2L + σ + O(log N) | CP choice |
Saga (async) | ≈ L (best-effort local) + retries | AP choice |
9. Recommendations for Senior Engineers
- Monolith → Regional: start with single-region ACID.
- Global Strong Consistency: use Raft/Paxos + 2PC + HLC/TrueTime.
- High-Throughput Eventual Consistency: employ Saga + message broker (Kafka/RocketMQ) + idempotent compensations.
- Deterministic Execution: consider Calvin if you can centralize sequencing.
- Failure Testing: inject network partitions, crash coordinators, and verify recovery paths.
Continue reading
More systemJoin the Discussion
Share your thoughts and insights about this system.