Real-Time Systems
What Are Real-Time Systems
June 4, 1996. Ariane 5 launches. Thirty-seven seconds later it explodes. Cause: integer overflow in the inertial reference system copied from Ariane 4. The system encountered a flight profile it was never designed for - and instead of graceful degradation, threw an unhandled exception. Cost: 370 million dollars. Boeing 737 MAX MCAS - a real-time system with a priority logic bug: one angle-of-attack sensor lied, the system believed it and overrode the pilots. 346 lives lost. The difference between "fast" and "real-time" is not speed - it is a guaranteed upper bound. ABS stops a wheel lockup in 5-7 ms not because it is fast, but because it always fits within 5-7 ms. Miss the deadline and the car does not stop.
- **Ariane 5, 1996:** 370 million dollars burned in 37 seconds due to a missed deadline in the inertial reference system. The same code worked on Ariane 4 - but Ariane 5 had a different acceleration profile, and the value overflowed an int16 boundary
- **ABS in every car:** 5-7 ms from wheel lock detection to brake pressure release - hard deadline. Stack: C, bare-metal RTOS. No Java, no GC, no malloc inside the control loop
- **HFT (High-Frequency Trading):** an order arriving at 101 µs instead of 100 goes to /dev/null - the price has moved. Firm RT. This is exactly why exchange servers live in the same building as the exchange (colocation) and connect via dedicated fiber
Hard Real-Time: the deadline is law
120 km/h, wet asphalt. A wheel locks up. ABS must release the brake cylinder pressure within **5-7 milliseconds** - or the car loses traction. Not "preferably," not "ideally" - **required**. This is **hard real-time**: a missed deadline is a system failure, regardless of whether the computation was correct.
| System | Deadline | Consequences of violation |
|---|---|---|
| ABS brakes | 5 ms | Skid, accident |
| Pacemaker | 10 ms | Cardiac arrest |
| Airbag | 15 ms | Will not deploy in time |
| Nuclear reactor control | 100 ms | Core meltdown |
| Aircraft autopilot | 50 ms | Loss of control |
In **hard real-time** systems, correctness is defined not only by whether the result is right, but also by the **time** at which it is produced. A correct answer delivered after the deadline is treated as an error - the same as a wrong answer.
This is why the ABS controller is written in C, not Java. A 50 ms GC pause at the wrong moment misses the deadline. Pacemakers run C and Ada. Boeing 737 MAX used C - but had a logic priority bug: the MCAS trusted a faulty angle-of-attack sensor over the pilots. Hard RT does not forgive logic errors any more than it forgives late responses.
An airbag controller computed the deployment moment but was 50 ms late. What happens?
Soft Real-Time: quality degrades
YouTube, 60 FPS. One frame arrived 5 ms late - a brief stutter. Two in a row - noticeable jank. But the video keeps playing, the server did not crash, no one was harmed. This is **soft real-time**: a missed deadline degrades quality without breaking the system. The utility function - the value of the result - decreases past the deadline but does not collapse to zero.
| System | Deadline | On violation |
|---|---|---|
| Video call | 33 ms (30 FPS) | Lag, stuttering video |
| Online game | 16 ms (60 FPS) | Freeze, teleporting characters |
| Music streaming | 10-50 ms buffer | Audio stuttering |
| GPS navigation | 1 sec update | Stale position on map |
| VoIP telephony | 150 ms | Echo, speech delay |
In **soft real-time** there is a **utility function**: the later the answer, the less useful it is, but it still has some value. In hard RT the utility drops to zero (or below!) instantaneously at the deadline.
Online game: the server must process a player action within 50 ms. Processing took 200 ms. What happens?
Firm Real-Time: stale results go in the bin
An exchange. The algorithm made a trading decision in 95 µs - excellent. But execution arrived at 200 µs instead of 100 - the price moved, the arbitrage window closed. The result is not dangerous, not degraded - simply **useless**. The order goes to /dev/null. This is **firm real-time**: the value of a stale result is strictly zero.
| Type | Deadline violation | Example | Value of stale result |
|---|---|---|---|
| Hard RT | Catastrophe | ABS, pacemaker | Negative (dangerous!) |
| Firm RT | Result is useless | HFT, assembly robot | Zero (worthless) |
| Soft RT | Quality degradation | Video call, games | Positive, but diminished |
**Firm RT** is often confused with hard RT, but the difference is critical: in hard RT a late result is **dangerous** (an airbag after impact), while in firm RT it is simply **useless** (a trade order at a price that no longer exists). Firm RT tolerates missing some percentage of deadlines.
A Toyota robotic conveyor: a part passes the assembly point in 500 ms. Miss the window - the part is gone, the conveyor keeps moving, the next part arrives in a second. Each miss is a defect or lost throughput. Not an accident. Firm RT.
A weather station must transmit temperature data every 10 minutes. The transmission was 2 minutes late. What type of RT is this?
Determinism: predictability over speed
Ariane 5 exploded 37 seconds into flight. The inertial reference system worked perfectly - 99.9% of the time. But at that exact moment an integer overflowed, and instead of graceful degradation the system threw an unhandled exception. 0.1% worst-case destroyed 99.9% uptime. **In real-time there are no statistics - only worst-case.** An algorithm averaging 1 ms but occasionally hitting 100 ms is worse than one that always takes 10 ms.
| Source of non-determinism | Cause | RT solution |
|---|---|---|
| Garbage Collection | GC pause 10-200 ms | Avoid GC languages (C, Rust) |
| Virtual Memory (page faults) | Disk access: ~10 ms | Lock pages in RAM (mlock) |
| Cache misses | L3 miss: ~100 ns | Cache partitioning, prefetch |
| Dynamic memory (malloc) | Fragmentation, variable time | Static allocation at startup |
| Interrupts from other devices | Unpredictable timing | Dedicate cores to RT (CPU isolation) |
**RTOS** (Real-Time Operating System): VxWorks, FreeRTOS, QNX - operating systems designed for determinism. They guarantee a maximum interrupt latency (< 10 µs). Standard Linux is not RT, but the **PREEMPT_RT** patch turns it into a soft RT system.
Core i9 server with a GC language vs a 2-dollar STM32 with bare-metal C. Server average latency: 0.3 ms. Worst-case: 80 ms (GC pause). STM32 average: 8 ms. Worst-case: 9 ms. For a 15 ms deadline the STM32 wins. Not because it is cheaper - because its behavior is provable. This is why pacemakers do not run Snapdragon.
Real-time = fast. The more powerful the processor, the more 'real-time' the system.
Real-time = predictable and on time. A slow but deterministic system beats a fast but unpredictable one.
'Real-time' is a guaranteed upper bound on response time, not an average speed. A JVM with GC can pause for up to 200 ms unpredictably. A 16 MHz STM32 with bare-metal code has a provable 10 ms worst-case. Ariane 5 did not explode because its code was slow - it exploded because one edge case was not covered by worst-case analysis. In RT the only question is: 'what happens in the worst case?' not 'what happens on average?'
For a hard RT robot controller: Linux (average 0.5 ms, worst-case 50 ms) or FreeRTOS (average 5 ms, worst-case 6 ms)?
Key Ideas
- **Real-time = predictably on time, not fast:** a 2-dollar STM32 beats a 5 GHz server when the deadline is 10 ms and the server has GC pauses up to 50 ms
- **Hard RT - zero tolerance:** ABS, pacemaker, nuclear reactor control. A correct answer after the deadline equals a wrong answer. Languages: C, Rust, Ada. No GC allowed
- **Soft RT - graceful degradation:** video calls, games, streaming. A dropped frame is a stutter, not a catastrophe. The utility function declines gradually past the deadline
- **Firm RT - stale is worthless:** HFT, robotic conveyors. A result after the deadline is neither dangerous nor degraded - it is simply discarded
Related Topics
RT systems draw on knowledge from several areas:
- Scheduling: Rate Monotonic — Algorithms for scheduling tasks with deadlines
- Worst-Case Execution Time — Worst-case execution time analysis for deadline guarantees
- Operating Systems — RTOS vs general-purpose OS - different design priorities
Вопросы для размышления
- Why do automotive controllers use $2 microcontrollers rather than powerful $200 processors?
- Can a standard Linux system be turned into a hard RT system? What would need to change?
- Back to the opening: how does ABS guarantee a 5 ms response across all possible scenarios?
Связанные уроки
- emb-01 — Real-time systems typically run on embedded hardware: interrupts, RTOS, constrained resources
- os-01-intro — RTOS is a specialization of OS: deterministic scheduler instead of fair-share, hard deadlines instead of throughput
- ds-01-intro — Distributed systems also require determinism and fault tolerance, but at network scale rather than a single device
- arch-01-binary — Understanding pipeline, cache, and CPU interrupts is critical for worst-case execution time analysis