CS 4536: Distributed Systems

Five recurring themes in distributed systems:

communication
coordination
naming
consistency and replication
fault tolerance

We will covering these by walking through MapReduce, GFS, Raft, Spanner, and Memcached.

The eight fallacies

A compact checklist for programmers to avoid making assumptions that are not true in a distributed system:

The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology doesn’t change
There is one administrator
Transport cost is zero
Transports are homogeneous

Three questions for designing distributed systems

Local observation: what this node have actually seen?
Hidden remote history: what else could have happened globally?
Model assumption: what assumptions are we making about the system (nodes, time, network)?

System model

A system model states assumptions about the network, the nodes, and timing. A distributed algorithm is correct only relative to that model. If the environment violates those assumptions, the algorithm may no longer be correct.

Network behavior

Model	Meaning
Reliable links	A message is received iff it was sent. Messages may still be reordered
Fair-loss links	Messages may be lost, duplicated, or reordered, but repeated retransmission can eventually succeed
Arbitrary links	An activate adversary may eavesdrop, modify, drop, spoof, or replay messages

A network partition is a period when some nodes cannot exchange messages at all, or only with extreme delay, even though the nodes themselves are still running.

Node behavavior

Crash-stop: a node halts and never returns.
Crash-recovery: a node crashes and later restarts, often losing volatile state but keeping durable state.
Byzantine: a node may behave arbitrarily.

A node that is not faulty is called correct

Timing behavior

Synchronous: known upper bounds on message delay and processing time.
Asynchronous: no timing bounds at all.
Partially synchronous: the system behaves synchronously often enough, or after some unknown point, but not always.

Here, “synchronous” means timing guarantees, not a blocking API call.

In an asynchronous system, you cannot reliably distinguish slow from failed.

Summary

Dimension	What we often assume in practice
Network	Often fair-loss links engineered toward reliable behavior with retry, acknowledgement, deduplication, and transient-partition assumptions.
Node	Usually crash-stop or crash-recovery; Byzantine is the stronger model for adversaries or arbitrary faults.
Timing	Usually partially synchronous: delays and pauses can spike, but not forever.

When an algorithm claims a guarantee, ask: under which system model?