Parallel Computing

Deadlock, Livelock, Starvation

July 1997. NASA's Mars Pathfinder lander touches down with the Sojourner rover. Days into the mission, the spacecraft begins resetting itself repeatedly. Diagnosis from Earth takes days. Root cause: Priority Inversion - a high-priority information bus task waits on a mutex held by a low-priority meteorological task, while medium-priority work keeps preempting the low one. The system spins. Fix: enable Priority Inheritance in the VxWorks RTOS. One scheduler flag, patched over 190 million kilometers of empty space. Three concurrent tasks. One missing synchronization property. A $150M mission on the line.

  • **PostgreSQL**: builds a wait-for graph across transactions, checks for cycles every `deadlock_timeout` (default 1s), and rolls back one victim automatically
  • **Bank transfer locks**: lock-ordering by account ID is the industry standard - always lock smaller ID first to break circular wait
  • **Go channels**: the runtime detects all-goroutine deadlock at startup and panics with a full goroutine trace
  • **Java**: `ThreadMXBean.findDeadlockedThreads()` surfaces deadlocked threads at runtime for monitoring

Five philosophers, five forks, one table

In 1965 Edsger Dijkstra formulated the Dining Philosophers problem: five philosophers sit at a round table with five forks between them. To eat, each needs both the left and right fork. If all five pick up their left fork simultaneously - deadlock. The problem became the canonical benchmark for synchronization algorithms. In 1971 Dijkstra proposed a solution through ordered resource acquisition - always pick up the lower-numbered fork first. Every modern deadlock prevention algorithm traces back to that 1965 formulation. The four necessary conditions for deadlock (Mutual Exclusion, Hold and Wait, No Preemption, Circular Wait) were formally characterized by Coffman, Elphick, and Shoshani in the same year.

Deadlock: the four Coffman conditions

**Deadlock** is mutual blockage - two or more threads each hold a resource the other needs, and none can proceed. The system is stuck, not crashed. CPU usage drops to zero on the deadlocked threads, which is often the first observable symptom.

**Wait-for Graph:** Deadlock equals a cycle in the waiting graph. Nodes are threads; edges mean 'waiting for a resource held by'. Deadlock detection is just cycle detection on this graph. PostgreSQL builds exactly this graph for transactions and checks for cycles once per second (controlled by `deadlock_timeout`, default 1s).

Thread T1 holds resource R1 and waits for R2. Thread T2 holds R2 and waits for R1. All four Coffman conditions hold. Which condition is easiest to break here?

Prevention and detection

Three strategies for living with deadlock: **Prevention** (break a Coffman condition structurally), **Avoidance** (Banker's Algorithm - check safe state before each allocation), **Detection + Recovery** (let deadlocks happen, then fix them).

**Banker's Algorithm (Avoidance):** Before allocating a resource, checks whether the system remains in a 'safe state' - meaning every thread can still finish. Theoretically optimal, but requires knowing each thread's maximum resource needs upfront. This is impractical for most real systems, so databases prefer detection + recovery.

PostgreSQL detects a deadlock between transactions T1 and T2. How does it resolve the situation?

Livelock: progress without motion

**Livelock** is the opposite failure mode from deadlock: threads are active and responding to each other, but the overall state never advances. CPU usage is high. No thread is blocked. And yet nothing gets done.

**Ethernet CSMA/CD** is a classic livelock prevention mechanism: when two stations collide, each waits a random interval before retrying (Binary Exponential Backoff). Without randomization, they would collide again on every retry, indefinitely.

Two HTTP clients simultaneously try to create a resource. On conflict, each retries immediately. What happens?

Starvation: the indefinite wait

**Starvation** is a scheduling fairness failure: a thread never gets the resource it needs because higher-priority competitors always arrive first. Unlike deadlock, the system works correctly for everyone else. One thread just waits forever.

**Read-Write Lock and writer starvation:** With many concurrent readers and a constant stream of new arrivals, a writer can wait indefinitely. Standard fix: once a writer starts waiting, new readers are blocked. This breaks read preference in favor of fairness.

Deadlock and starvation are the same thing - a thread not getting a resource

Deadlock is mutual blockage via circular wait - all participants are stuck. Starvation is a scheduling fairness failure - the system works, one thread is systematically skipped.

In deadlock, the system itself is caught in a cycle; no one proceeds. In starvation, other threads proceed just fine - one particular thread never gets scheduled. Different causes, different fixes: deadlock needs lock ordering or detection+recovery; starvation needs fair scheduling with aging.

In a Priority Scheduling system, a low-priority thread (priority=10) has been waiting for CPU for 10 minutes. Which mechanism prevents starvation?

Key Ideas

  • **Deadlock** requires all four Coffman conditions. Break any one - lock ordering (Circular Wait) is the most practical.
  • **Prevention**: lock ordering. **Detection**: cycle in wait-for graph. **Recovery**: roll back one victim (PostgreSQL's approach).
  • **Livelock**: threads are active but the state never advances. Fix: randomize retry intervals with exponential backoff + jitter.
  • **Starvation**: one thread never gets the resource because higher-priority work always arrives. Fix: aging - raise effective priority over wait time.
  • **Deadlock vs starvation**: deadlock blocks all participants in a cycle. Starvation is an unfairness problem - the system runs, one thread is skipped.

What's next

Deadlock is the failure mode of lock-based concurrency. Lock-free approaches eliminate it structurally.

  • Lock-Free Data Structures — Atomic operations and CAS - how to build concurrent data structures without locks, eliminating deadlock by construction
  • OS Deadlocks — The same four Coffman conditions applied at OS process level - resource allocation graphs and system-level recovery

Вопросы для размышления

  • Why does detection + recovery often beat prevention in databases, even though prevention sounds safer?
  • Livelock is hard to diagnose - CPU usage is high but no work gets done. How would one distinguish livelock from genuine load in a production system?
  • Priority Inheritance fixes Priority Inversion but introduces propagation chains. When is Priority Ceiling the better choice?

Связанные уроки

  • os-05-sync — Mutexes and semaphores are the primitives that deadlock operates on
  • os-06-deadlocks — OS-level deadlock detection and parallel-computing deadlock are the same phenomenon at different abstraction layers
  • par-03 — Lock primitives covered in par-03 are the mechanism through which deadlocks form
  • par-05 — Lock-free and wait-free algorithms are the advanced answer to all three problems
  • os-04-scheduling — Starvation is a scheduling fairness problem - aging fixes it at the scheduler level
Deadlock, Livelock, Starvation

0

1

Sign In