ArchitectureStaff

Staff Prep 27: Distributed Systems — CAP Theorem, Eventual Consistency & Conflict Resolution

April 4, 202610 min readPART 05 / 06

Every engineer has heard of CAP theorem. Most recite it wrong. The honest version goes like this. During a network partition you have to choose between Consistency and Availability. You do not get both. Partition Tolerance is not a third option you can opt out of. You either handle partitions or your system falls over the first time the network sneezes. That's it. Everything else in this post is just the implications.

The real CAP theorem

Eric Brewer's original claim: in the presence of a network partition, a distributed system can guarantee either Consistency (every read returns the most recent write or an error) or Availability (every request receives a non-error response, though it may be stale). Not both.

The key phrase is "in the presence of a partition." Without a partition, you can have both. Most of the time your network is fine and this is not relevant. CAP only forces a trade-off when nodes cannot communicate.

What this means practically:

CP systems like Postgres, ZooKeeper and HBase refuse writes or return errors during a partition rather than serve stale data.
AP systems like Cassandra, DynamoDB and CouchDB keep serving requests during a partition, accepting that some responses may be stale.

Consistency models (there is a spectrum)

CAP's "consistency" means linearisability, the strongest model. In the real world there's a whole spectrum and most systems sit somewhere in the middle:

text

Strongest → Weakest

Linearisability  — reads always see latest write, globally ordered
Sequential       — operations appear in some global order (not necessarily real-time)
Causal           — causally related operations are seen in order
Read-your-writes — you always see your own writes
Eventual         — given no new writes, all replicas converge eventually

Postgres replication with synchronous_commit=on is linearisable within a single shard. The same Postgres with async replicas is read-your-writes on the primary and eventual on the replicas. Same database, different consistency guarantees, depending on one config flag.

Eventual consistency in practice

Eventual consistency means: if you stop writing, all replicas will eventually agree. It says nothing about how long that takes or what you see in the meantime.

The classic problem: two users update the same record on different nodes during a partition. When the partition heals, both updates exist. Who wins?

Three common resolution strategies:

text

1. Last Write Wins (LWW)
   — Use wall-clock timestamp. Highest timestamp wins.
   — Problem: clocks drift. You can lose writes.
   — Used by: Cassandra (default), DynamoDB

2. Version Vectors (Vector Clocks)
   — Each node tracks a counter for each other node.
   — [nodeA: 3, nodeB: 2] vs [nodeA: 2, nodeB: 3] = concurrent conflict
   — Surface the conflict to the application to resolve.
   — Used by: Riak, DynamoDB (conditional writes)

3. CRDTs (Conflict-free Replicated Data Types)
   — Data structures mathematically guaranteed to merge without conflicts.
   — Examples: G-Counter (increment only), OR-Set (add/remove set), LWW-Register
   — Used by: Redis (some types), collaborative editors (Figma, Notion)

The PACELC extension

CAP only covers partition scenarios. PACELC adds: even when there is no partition (E), there is a trade-off between Latency (L) and Consistency (C).

Example: Postgres synchronous replication gives you consistency but adds write latency (must wait for replica ack). Async replication is faster but risks data loss if the primary fails before replication.

sql

-- Synchronous replication: consistent but slower
synchronous_commit = on        -- wait for replica WAL write
synchronous_standby_names = '*'

-- Asynchronous replication: faster but eventual
synchronous_commit = off       -- return success immediately
-- Risk: up to wal_writer_delay ms of data loss on crash

Practical patterns for distributed state

Sagas for distributed transactions: When you need multi-step operations across services with no distributed transaction support, use a saga. Each step has a compensating action. On failure, roll back by executing compensating actions in reverse.

python

# Choreography saga — services react to events
# Step 1: Order service creates order (PENDING)
# Step 2: Payment service charges card → emits PAYMENT_SUCCESS
# Step 3: Inventory service reserves stock → emits INVENTORY_RESERVED
# Step 4: Order service marks CONFIRMED

# On PAYMENT_FAILED: order service cancels → emits ORDER_CANCELLED
# On INVENTORY_FAILED: payment service refunds → emits PAYMENT_REFUNDED

Idempotency keys: When retrying operations across unreliable networks, use idempotency keys to deduplicate. The server stores the key and result — if the same key arrives again, return the stored result without re-executing.

python

@app.post("/payments")
async def create_payment(
    request: PaymentRequest,
    idempotency_key: str = Header(...)
):
    # Check if we've already processed this key
    existing = await redis.get(f"idem:{idempotency_key}")
    if existing:
        return json.loads(existing)

    result = await process_payment(request)
    await redis.setex(
        f"idem:{idempotency_key}",
        86400,  # 24h TTL
        json.dumps(result)
    )
    return result

When to choose CP vs AP

Use CP (strong consistency) when correctness is non-negotiable:

Financial transactions, inventory counts, seat booking
Auth tokens, permissions (never serve stale access control)
Leader election, distributed locks

Use AP (eventual consistency) when availability matters more than perfect accuracy:

Shopping cart (losing a cart item is bad; cart being unavailable is worse)
Social feed, like counts, view counters
Search indexes, recommendations, analytics

The move I've seen work best: put the parts that must be correct (payments, inventory, auth) on CP storage, and let everything else live on AP storage. Don't pay the consistency tax on your view counters.

Quiz: test your understanding

Before moving on, answer these in your head (or out loud):

Your e-commerce site uses DynamoDB (AP). A customer adds an item to their cart on the US-East node. A network partition occurs. They immediately check their cart on US-West — it is empty. Is this expected? How would you mitigate this?
You are designing a distributed counter for tracking API rate limits. You need the count to be accurate within 5%. Would you choose a CP or AP approach? What data structure would you use?
Explain the difference between Last Write Wins and vector clocks. When does LWW silently lose data? Give a concrete example.
Your payment service calls three downstream services in sequence. The third call fails after the first two succeed. How do you ensure the system returns to a consistent state? What information do you need to store, and where?
A colleague proposes using Redis for session storage, saying "it is fast and available." You know Redis uses async replication by default. What specific failure scenario should you warn them about, and how would you mitigate it?

Next up: Part 28: Observability. Metrics, logs and traces, and how to actually use them to find production problems.

← PREV

Staff Prep 26: Load Balancing Strategies — L4 vs L7, Health Checks & Consistent Hashing