Blockchain
The Consensus Problem: Why and How
Three people are deciding where to have dinner. One of them always tells different people different things. It seems like an innocent problem, but in 1982 Leslie Lamport proved that with three participants and one liar, agreement is mathematically impossible. It's not a matter of cleverness or algorithm - it's a mathematical law. That same law determines why a Bitcoin transaction takes 60 minutes to confirm, and why Cosmos Hub once fully halted for half an hour.
- **Bitcoin** processes $50B+ in transfers monthly without a single arbiter - Nakamoto consensus lets thousands of anonymous nodes agree on who owes what to whom
- **Cosmos Hub (Tendermint)** prefers halting over an error: in 2023 the network froze for 30 minutes during a validator failure - safety over liveness
- **Ethereum** after The Merge combines both approaches: keeps running during a fork (liveness), but after 2 epochs (~13 min) provides economic finality with billion-dollar penalties for reversal
Lamport and the Generals Who Never Were
Leslie Lamport is a scientist better known as the creator of LaTeX and Paxos. He formulated the Byzantine Generals Problem not out of interest in military history, but to describe failures in NASA computer systems: spacecraft on-board computers must keep running even if one module starts producing incorrect data. Lamport chose the Byzantine general metaphor because a "traitor" accurately describes a faulty computer's behavior: it doesn't just go silent - it actively sends false data, and different data to different recipients.
Without formalizing Byzantine fault tolerance, there would be no PBFT, no Tendermint, and no Bitcoin. Lamport gave us the language to describe the problem - and that's the first step toward solving it.
Предварительные знания
The Byzantine Generals Problem
Five armies surround a fortress. They can only attack **simultaneously** - otherwise they lose. The generals exchange messengers, but one general is a **traitor**. He sends some generals "attack" and others "retreat". How can the honest generals reach agreement?
This is the **Byzantine Generals Problem**, formulated by Leslie Lamport in 1982. It models a fundamental challenge: how to reach consensus in a network where **participants don't trust each other** and some may **actively sabotage** the process.
In a blockchain, the "generals" are nodes, the "messengers" are network messages, and the "traitors" are malicious or compromised nodes. Every time a Bitcoin node receives a block, it solves the problem: **is this block valid, or is the sending node trying to deceive the network?**
**Lamport's result:** the problem is solvable only if traitors are **strictly fewer than 1/3** of the total. With 3 generals and 1 traitor - **no solution exists**. This is a fundamental constraint, not a flaw of any particular algorithm.
In a network of 9 nodes, what is the maximum number that can be Byzantine (malicious) while consensus is still possible?
Crash Fault vs Byzantine Fault
Not all failures are equal. A server may **crash** (stop responding) - that's a Crash Fault. Or it may **lie** (send different answers to different nodes) - that's a Byzantine Fault. The difference determines how many nodes can fail.
| Crash Fault (CFT) | Byzantine Fault (BFT) | |
|---|---|---|
| Faulty node behavior | Silently stops responding | Can lie, send different data to different nodes |
| Threshold | N/2 - need more than half alive | N/3 - need more than 2/3 honest |
| Minimum nodes for 1 failure | 3 (2f+1, f=1) | 4 (3f+1, f=1) |
| Example algorithm | Raft, Paxos, ZooKeeper | PBFT, Tendermint, HotStuff |
| Where used | Inside a trusted datacenter | Blockchain, open networks |
Why does the threshold differ? With a **Crash Fault**, the crashed node simply stays silent - others notice and ignore it. With a **Byzantine Fault**, the malicious node actively deceives: sending node A one value and node B another. Neutralizing such an attack requires more "votes".
**CAP theorem** (Brewer, 2000) adds another constraint. A distributed system cannot simultaneously guarantee all three properties: - **Consistency** - all nodes see the same data - **Availability** - every request gets a response - **Partition tolerance** - the system works despite network splits
In real networks **partitions (splits) are inevitable** - cables break, datacenters lose connectivity. So in practice the choice is always between **CP** (consistency under partition) and **AP** (availability under partition). Bitcoin chose AP - the network keeps running during a split, but temporary forks may occur.
A company launches a private blockchain on 10 nodes in its own datacenters (trusted environment). What type of fault tolerance is sufficient?
Finality: When a Transaction Is Irreversible
After sending 1 BTC and seeing a confirmation, can it be certain the transaction will **never** be reversed? The answer depends on the type of **finality** - the guarantee of irreversibility.
There are two different approaches: **Probabilistic finality** - the probability of reversal decreases with every new block, but theoretically never reaches zero. Bitcoin works this way. **Deterministic (absolute) finality** - once confirmed, a transaction is irreversible **mathematically**. BFT protocols (Tendermint, HotStuff) provide this guarantee.
**Why exactly 6 confirmations in Bitcoin?** With 10% hashrate, the attacker's reversal probability after 6 blocks is less than 0.00002%. This means roughly 60 minutes of waiting (6 blocks × ~10 min). For large amounts, exchanges wait 12-60 confirmations.
| Property | Probabilistic (Bitcoin, Ethereum PoW) | Deterministic (Tendermint, HotStuff) |
|---|---|---|
| Guarantee | Reversal probability → 0, but never = 0 | After 2/3 votes - mathematically irreversible |
| Time to finality | ~60 min (6 blocks × 10 min) | ~1-6 seconds |
| Fork possible? | Yes (temporary fork is normal) | No (fork = protocol violation) |
| Throughput | ~7 tx/s (Bitcoin) | ~1,000-10,000 tx/s |
| Scalability | Thousands of nodes | Tens to hundreds of nodes (communication overhead) |
Real attack: double-spend on Bitcoin Gold (2018)
Probabilistic finality in practice
In May 2018, an attacker rented mining power and gained >51% of the Bitcoin Gold network hashrate. They: 1. Sent BTG to an exchange and received other crypto 2. Secretly mined an alternative chain 3. Published the longer chain - the network accepted it as canonical 4. The exchange transaction "disappeared" - a $18M double-spend Exchanges that waited for few confirmations suffered. Those requiring >20 confirmations did not.
**Ethereum** after The Merge (2022) switched to PoS with "economic finality": every 6.4 minutes (1 epoch) validators vote on blocks. After 2 epochs (~13 minutes) a block gains finality - reversing it would require destroying ETH worth billions (slashing).
A seller is accepting 5 BTC for a car. The buyer sent a transaction visible in the mempool (0 confirmations). Should the keys be handed over?
Liveness and Safety: The Impossible Tradeoff
Every consensus protocol balances two fundamental properties: **Safety**: all honest nodes agree on **the same** value. No contradictions, no fork. "Better to make no decision than a wrong one." **Liveness**: the system **keeps running** and processes new transactions. It doesn't halt or stall. "The system always makes progress."
In 1985, Fischer, Lynch, and Paterson proved **FLP impossibility** - one of the most important results in distributed systems: *In an asynchronous network (where message delivery time is not guaranteed), no deterministic consensus protocol can guarantee both safety and liveness if even one node can fail.*
**FLP impossibility** doesn't mean consensus is impossible! It means one of the requirements must be **relaxed**: allow probabilistic resolution (Bitcoin), introduce timeouts (PBFT), or assume partial synchrony (Tendermint).
| Protocol | Priority | Safety | Liveness | Tradeoff cost |
|---|---|---|---|---|
| Bitcoin (Nakamoto) | Liveness | Probabilistic - fork possible | Always running | Wait 6 blocks for finality |
| Tendermint (Cosmos) | Safety | Absolute - no fork | May halt if >1/3 offline | Network freezes if too many nodes offline |
| Ethereum PoS (Gasper) | Both (hybrid) | Eventual finality (2 epochs) | Continues during fork | Most complex protocol, attacks at boundaries |
Cosmos Hub halt (2023)
Safety > Liveness in practice
In June 2023, Cosmos Hub (Tendermint BFT) stopped for ~30 minutes due to an upgrade bug. More than 1/3 of validators crashed, and the network **stopped producing blocks**. This is Tendermint by design: better to halt (violate liveness) than allow a fork (violate safety). Nobody lost money, no double-spends. Compare to Bitcoin Gold: the network never halted, but an $18M double-spend happened - liveness was preserved, safety was violated.
**FLP impossibility** is not an abstract theorem. Every promise that "our protocol guarantees both safety and availability" hides assumptions. Usually it's "under partial synchrony" or "if fewer than 1/3 of nodes are offline".
Bitcoin doesn't solve the Byzantine Generals Problem because it allows temporary forks
Bitcoin solves the Byzantine Generals Problem **differently**: through probabilistic finality and economic incentives (PoW). A temporary fork is not a bug but a feature - the price of liveness in an open network without pre-registered participants.
Classical BFT (PBFT, Tendermint) assumes a fixed set of validators - everyone knows who participates. Bitcoin allows **anyone** to join (permissionless). For this, Satoshi replaced voting with computational work (PoW) and the longest chain rule - this is Nakamoto consensus, a new approach to solving the Byzantine Generals Problem that doesn't require a fixed set of participants.
In a Tendermint (BFT, safety-first) network of 100 validators, 35 lose connectivity. What happens?
Key Ideas
- **The Byzantine Generals Problem** - the fundamental problem of reaching agreement in the presence of traitors. A solution is possible only if honest participants are **more than 2/3** (for BFT) or **more than 1/2** (for CFT)
- **Crash Fault vs Byzantine Fault** - a crashed node (silent) vs a malicious node (lying). BFT is stricter: `3f+1` nodes for f failures, CFT is more lenient: `2f+1`
- **Finality** - either probabilistic (Bitcoin: reversal probability → 0 with each block) or deterministic (Tendermint: irreversible after 2/3+1 quorum)
- **Safety vs Liveness** - FLP impossibility: both properties cannot be guaranteed in an asynchronous network. Bitcoin chooses liveness (always running, but fork possible), Tendermint chooses safety (no fork, but may halt)
- Remember the three people and the restaurant from the beginning of the lesson? Three people, one liar - and Lamport proved agreement is impossible. Every blockchain protocol is an engineering answer to that mathematical prohibition: it cannot be circumvented, only traded off
Related Topics
Consensus is the foundation on which specific protocols are built:
- P2P Networks and Gossip Protocols — Consensus runs on top of P2P networks - gossip delivers messages between nodes
- Proof of Work — Nakamoto consensus - solving the Byzantine Generals Problem through computational work and the longest chain rule
- Proof of Stake — Alternative to PoW: economic stake instead of computation, economic finality via slashing
- BFT Protocols — PBFT, Tendermint, HotStuff - concrete implementations of Byzantine Fault Tolerance with deterministic finality
Вопросы для размышления
- Bitcoin has existed for 15+ years and has never violated safety (no double-spend in the main chain). Does this mean that probabilistic finality is in practice no worse than deterministic? Or have we simply not yet encountered a sufficiently motivated attacker?
- The CAP theorem forces a choice between consistency and availability during a network split. When designing an international bank transfer system - which property takes priority and why?
- FLP impossibility is proven for deterministic protocols. Bitcoin circumvents this via randomization (PoW - whoever finds the hash first wins). What other approaches to circumventing FLP exist?
Связанные уроки
- bc-04-p2p — Consensus is about nodes in a P2P network reaching agreement; the P2P network structure is the prerequisite
- bc-14-pow — Proof of Work is the first concrete consensus mechanism; this lesson provides the abstract foundation
- bc-15-pos — Proof of Stake is the second major consensus family and needs this abstract intro
- bc-16-bft — BFT consensus (PBFT, Tendermint) is the classical family; the intro lesson is the prerequisite
- dist-10-byzantine