Distributed Systems
Byzantine Fault Tolerance
Цели урока
- Understand the difference between Byzantine and crash faults and why Raft/Paxos do not protect against the former
- Know the 3f+1 theorem and be able to calculate the required number of nodes
- Understand the three phases of PBFT and the purpose of each
- Know where BFT is applied in practice and when it is unnecessary
Предварительные знания
- How Raft/Paxos work (crash fault tolerance)
- Concept of quorum and majority voting
- Basic understanding of cryptographic signatures
Ethereum DAO hack 2016: USD 60M stolen not through a network breach, but through Byzantine contract behavior - nodes did exactly what the code said, yet the outcome was catastrophic. Paxos/Raft are powerless when nodes lie.
- **Bitcoin (PoW)** - probabilistic BFT for a public trustless network: 51% honest nodes guarantee consensus
- **Hyperledger Fabric** - PBFT for interbank settlements where participants are competitors
- **Boeing 777** - triple-modular redundancy for radiation-induced bit flips (hardware Byzantine)
- **SpaceX Falcon 9** - 3 Flight Management Computers with Byzantine-tolerant voting
- **Facebook Diem/Libra** - HotStuff BFT as the foundation of LibraBFT with O(N) view change
Lamport and the problem of traitors
In 1982, Lamport, Shostak and Pease published 'Byzantine Generals Problem' - the name was invented by Lamport as an analogy for a hard-to-explain problem. Before that it was called the 'Interactive Consistency Problem', which was far less memorable. The paper proved so influential that the term 'Byzantine fault' entered the standard vocabulary of distributed systems and is now used in official aviation standards DO-178C.
The Byzantine Generals Problem
**June 2016. The Ethereum DAO smart contract holds USD 150M. An attacker finds a vulnerability and recursively calls withdraw 1800 times - each time receiving 50 ETH before the balance reaches zero. USD 60M stolen.** Not a network bug - Ethereum nodes executed the code correctly. This is Byzantine fault at the logic level: the node did exactly what was written, but intentionally destructively.
Lamport, Shostak and Pease in 1982 formalized this problem through a military metaphor. N generals are besieging a city and must agree: ATTACK or RETREAT. Communication is only through messengers. Some generals are traitors who actively sabotage consensus.
**Byzantine fault** - a node behaves arbitrarily: lies about received messages, sends different data to different recipients, colludes with other traitors. The key difference from crash fault: a silent node is predictable, a lying node is not.
| Fault type | Node behavior | Protection algorithm |
|---|---|---|
| Crash (CFT) | Node crashed and silent - predictable | Raft, Paxos - needs 2f+1 nodes |
| Omission | Node silently drops some messages | Raft with timeouts |
| Byzantine (BFT) | Node lies, sends contradictions, colludes | PBFT, HotStuff - needs 3f+1 nodes |
**Paxos/Raft do not protect against Byzantine faults.** Both algorithms assume crash-stop model: a node either works correctly or goes silent. One malicious node can break a Raft cluster by responding with different data to different replicas and violating invariants.
The generals problem illustrates the core issue: with three generals and one traitor, the honest ones cannot reach consensus. If a traitor-commander sends general A the order ATTACK and general B the order RETREAT, A and B see a conflict but cannot tell who received the false order.
A large Raft cluster will protect against compromised nodes
Raft only protects against crash faults. One compromised node in a Raft cluster can violate safety.
Raft requires 2f+1 nodes for f crash failures. But the algorithm does not verify node honesty - it simply counts majority votes. A Byzantine node votes for incorrect values, and the quorum may accept a false decision.
What fundamentally distinguishes a Byzantine fault from a crash fault?
The 3f+1 Theorem: minimum nodes for Byzantine consensus
**Fundamental result by Lamport, Shostak and Pease (1982):** achieving Byzantine consensus with f malicious nodes requires at least 3f+1 total nodes. This is a mathematically proven lower bound - impossible to circumvent without additional assumptions.
**Intuition behind 3f+1:** with f Byzantine nodes and f silent nodes (simulating crash), f+1 honest votes remain. The quorum is 2f+1. Of those, at least f+1 are honest. This is enough to outvote f traitors. Total: f (Byzantine) + f (silent) + f+1 (honest) = 3f+1.
| f (traitors) | Nodes needed (3f+1) | Quorum (2f+1) | Min honest in quorum |
|---|---|---|---|
| 1 | 4 | 3 | 2 |
| 2 | 7 | 5 | 3 |
| 3 | 10 | 7 | 4 |
| 10 | 31 | 21 | 11 |
Why 3 nodes are insufficient for f=1: suppose the commander is a traitor. He sends general G1 the order ATTACK, general G2 the order RETREAT. G1 and G2 exchange messages. G1 tells G2: 'I received ATTACK'. G2 tells G1: 'I received RETREAT'. Both see a conflict, but both statements are equally credible - there is no way to determine which of the three (Commander, G1, G2) is the traitor.
With 4 nodes (3f+1 for f=1), the same scenario resolves: a quorum of 3 contains at least 2 honest nodes. Two honest nodes agree with each other, outvoting one traitor.
The 3f+1 theorem can be circumvented with clever cryptography or reliable hardware
3f+1 is an information-theoretic lower bound, independent of implementation
Lamport's proof works in a model without encryption (oral messages). With authentication (signed messages), the lower bound drops to 2f+1, but this is a different, stronger threat model.
A bank builds a transaction confirmation system with 7 nodes. How many Byzantine nodes can it tolerate?
PBFT: the first practical BFT algorithm
**Castro and Liskov, MIT, 1999. PBFT (Practical Byzantine Fault Tolerance) was the first algorithm to make BFT practically applicable.** Before PBFT, theoretical BFT protocols were too slow for real use - unlike Paxos. PBFT demonstrated only 3% overhead compared to non-BFT systems in small clusters.
Three phases of PBFT
**Pre-prepare:** Primary receives the client request, assigns sequence number n, and broadcasts to all replicas. **Prepare:** each replica receiving Pre-prepare broadcasts Prepare to all others. Prepared state is reached when 2f Prepare messages are collected. **Commit:** after prepared, each node sends Commit to all. After 2f+1 Commits - operation executes and client receives Reply.
**View Change:** if a replica does not receive Pre-prepare in time - it suspects the Primary. It sends VIEW-CHANGE with proof of prepared messages. The new Primary (view+1 mod N) collects 2f VIEW-CHANGE messages and broadcasts NEW-VIEW. Expensive: O(N^2) messages. A malicious Primary can intentionally trigger view changes to slow the system.
| Metric | Crash-tolerant (Raft) | BFT (PBFT) | BFT (HotStuff) |
|---|---|---|---|
| Nodes for f faults | 2f+1 | 3f+1 | 3f+1 |
| Messages per operation | O(N) | O(N^2) | O(N) |
| Rounds | 2 | 3 | 3 |
| Cryptography | Not needed | MAC/signatures | Threshold signatures |
| View change | O(N) | O(N^2) | O(N) |
Why does the PBFT client wait for f+1 identical replies rather than just one?
BFT in practice: when it is needed and what it costs
**HotStuff (Facebook/Meta, 2018) solved PBFT's main problem: O(N^2) messages during view change.** The key idea is threshold signatures: instead of the leader broadcasting N^2 messages, each replica sends its signature to the leader, the leader aggregates into one combined signature and does a single broadcast. View change: O(N) instead of O(N^2). HotStuff powers Diem/Libra and forms the basis of LibraBFT.
| Domain | Example | Why BFT, not Raft |
|---|---|---|
| Public blockchain | Bitcoin, Ethereum | Nodes belong to strangers, no trust |
| Permissioned blockchain | Hyperledger Fabric | Consortium of organizations - competitors in one network |
| Financial settlement | SWIFT, interbank | Protection against insiders and compromised nodes |
| Aviation/Space | Boeing 777, SpaceX | Radiation can flip bits - hardware Byzantine |
| Critical infrastructure | Power grids | APT attacks on control nodes |
**Most systems do not need BFT** - Raft or Paxos is sufficient. If all nodes are controlled by one organization - Raft is sufficient and orders of magnitude faster. BFT is needed when participants do not trust each other (different organizations, public networks) or when hardware failure can produce Byzantine behavior (space, nuclear).
Practical decision guide: internal systems (microservices, company databases) - use Raft or Paxos. Multi-org systems where participants compete or do not trust each other - use BFT. Public blockchains use probabilistic BFT (Bitcoin PoW, Ethereum PoS) which trades finality for scalability.
BFT is only needed for blockchains and cryptocurrencies
BFT is needed wherever nodes are controlled by different untrusting parties or where hardware can produce Byzantine failures
Boeing 777 avionics use triple-modular redundancy - classic BFT for hardware faults (radiation, bit flips). SpaceX Falcon 9 has 3 Flight Management Computers with Byzantine-tolerant majority voting. This predates blockchain by decades.
A startup builds a settlement system for 5 competing banks. Raft or PBFT?
Вопросы для размышления
- Most production systems use Raft/Paxos without BFT. What specific conditions in a company's architecture would signal the need to switch to a BFT protocol?
Связанные уроки
- dist-09-raft — Raft is the crash-fault baseline; BFT extends it
- dist-08-paxos — Classic Paxos assumes crash faults only
- ds-03-consensus — BFT consensus is a harder variant of the consensus problem
- dist-12-consistency — BFT changes the impossibility boundaries for consistency
- crypto-28-digital-signatures