Systems Theory
Resilience: Why Some Systems Survive and Others Collapse
Yellowstone forest recovered after 36% of its territory was destroyed. Lehman Brothers collapsed forever in a week. Why did one giant survive while the other disappeared? The answer lies in the concept of resilience.
- **Business**: Companies that survived COVID had different resilience strategies
- **Technology**: Netflix survived an AWS outage thanks to Chaos Engineering
- **Finance**: An emergency fund is literally resilience in monetary form
- **Health**: The immune system is biological resilience
Resilience ≠ Stability
**Yellowstone, 1988**. A massive wildfire destroyed 36% of the forest - 800,000 hectares. A disaster? Thirty years later, the forest recovered and became **healthier**. Meanwhile, **Lehman Brothers** collapsed in a week in 2008 and never recovered.
**Stability** - a system's ability to RESIST change and remain in its current state.
**Resilience** - a system's ability to RECOVER after a shock and return to functioning.
**Oak vs Bamboo**: An oak is rigid, resists the wind - but snaps in a hurricane. Bamboo bends to the ground - but springs back up after the storm.
A company hadn't changed its business model in 20 years. A competitor appeared and took 50% of the market. The company couldn't adapt and shut down. This is a problem of...
Three Sources of Resilience
Where does a system's ability to recover come from?
**1. Diversity** - many different elements = many paths to functioning. An ecosystem with 100 species is more resilient than one with 10.
**2. Modularity** - the system consists of relatively independent parts. The failure of one part doesn't kill the whole system.
**3. Redundancy** - duplication of critical functions. Two engines on a plane. A financial emergency fund.
| Source | Example | How it works |
|---|---|---|
| Diversity | 5 products vs 1 | A drop in one market doesn't kill the company |
| Modularity | Microservices vs monolith | A service failure doesn't take down the system |
| Redundancy | 6 months of expenses in the bank | Losing a job isn't a catastrophe |
Which source of resilience helped Yellowstone forest recover after the wildfire?
Signs of Fragility
How can one tell when a system is becoming fragile?
- **Over-optimization** - 'cut everything unnecessary' = cut the buffers
- **Hidden interdependence** - 'Who knew THAT depended on THIS?'
- **Compression of diversity** - one product, one supplier, one technology
- **Long period without stress** - '10 years without failures' = nobody knows how to respond
**Lehman Brothers - all four signs:**
| Sign | At Lehman Brothers |
|---|---|
| Over-optimization | Leverage 30:1, minimal reserves |
| Hidden connections | Derivatives linked everyone to everyone |
| No diversity | Everyone held similar mortgage-backed securities |
| No stress | 25 years of market growth, 'prices only go up' |
Lehman didn't die from a single blow - it was already dead, it just didn't know it yet.
Resilience is expensive and inefficient. Better to optimize.
Resilience is insurance. The cost of fragility during a crisis is usually higher than the cost of maintaining reserves.
Lehman was super-optimized. Its collapse cost the global economy $600 billion. Netflix spends on redundancy but never goes down.
A company is proud that it 'optimized away everything unnecessary' and runs on zero inventory. This is...
Designing for Resilience
Resilience doesn't happen on its own - it must be intentionally designed:
- **Embrace redundancy** - yes, it's 'inefficient'. That's the price of resilience.
- **Loose coupling** - modules can fail independently. Clear boundaries.
- **Safe-to-fail experiments** - small probes instead of big bets.
- **Feedback loops** - learn about problems quickly. Monitoring, alerts.
**Netflix Chaos Engineering**: Netflix deliberately 'kills' its own services in production. Chaos Monkey randomly shuts down services.
**Result**: Teams are FORCED to design for failures. When AWS actually goes down, Netflix keeps running. Competitors don't.
**Antifragility (Taleb)**: Some systems actually get STRONGER from stress. Like muscles under load. Controlled stress strengthens.
Why does Netflix deliberately 'kill' its own services in production?
Key Ideas
- **Resilience ≠ stability**: recovery vs resistance to change
- **Three sources**: diversity, modularity, redundancy
- **Signs of fragility**: over-optimization, hidden connections, monoculture, absence of stress
- **Design principles**: redundancy, loose coupling, safe-to-fail, feedback loops
- **Antifragility**: controlled stress strengthens the system
What's Next?
We've studied the core concepts of systems. Now let's apply them to real-world systems.
- Ecosystems — Natural systems as a model of resilience
- Economic Systems — Markets, crises, antifragility
- Social Systems — Society as a complex adaptive system
Вопросы для размышления
- What are the structural sources of resilience in a complex system? Where are signs of fragility most likely to appear?
- What categories of reserves determine system resilience: capital, capabilities, or relationships?
- What design changes increase resilience in a system at high load?
Связанные уроки
- st-07-adaptation — Adaptation determines how a system responds to shocks
- st-01-feedback-loops — Feedback loops are the recovery mechanism
- st-09-ecology — Ecosystems are a living model of resilience through diversity
- st-04-leverage — Leverage points show where to build reserves
- prob-04-bayes — Bayesian updating mirrors how a system adapts after a shock
- dyn-08