Real-Time Systems
RT System Design: Automotive, Avionics, Medical, Industrial
In 1986 the Ariane 5 rocket exploded 37 seconds into flight - half a billion dollars in damage. The cause: a 16-bit integer overflow in code reused from Ariane 4 without analyzing the diff in requirements. Eight years later the Therac-25 machine killed three people through a race condition in mode-switching code. These cases became textbooks for four industries - automotive, aviation, medical, industrial - each developing its own set of standards and architectural patterns. On the surface the worlds look separate, but behind the standards lie three shared ideas: temporal isolation, spatial isolation, and defence in depth.
- **Tesla**: Autopilot DAS3.0 uses a mixed-criticality architecture on two Tesla FSD chips with lockstep mode and a watchdog between them - certified to ISO 26262 ASIL D
- **Airbus A350**: the FCMC (Flight Control Module Computer) uses ARINC 653 partitioning on VxWorks 653; DAL A components passed formal verification
- **Medtronic infusion pumps**: an IEC 62304 Class C architecture with an independent safety MCU and a hardware watchdog reduced recall events by 60% versus the single-MCU predecessor
- **Siemens SIMATIC S7-1500**: a PLC with a 1 ms scan cycle and IEC 61508 SIL 3 certification, used in turbine and nuclear plant control
Automotive: AUTOSAR, CAN and ISO 26262
A modern car contains up to 150 ECUs (Electronic Control Units) connected by CAN, FlexRay, and Ethernet buses. The **AUTOSAR Classic** standard defines the architecture: a hard runtime with static task scheduling, fixed memory allocation, on top of an OSEK/VDX-level RTOS. The brake system reaction to a pedal press is 10 ms from sensor to actuator, and that is a budget, not an average. The safety standard **ISO 26262** defines four ASIL levels (A-D): the higher the risk level, the stricter the requirements on development, testing, and architecture. ASIL D covers functions whose failure leads to casualties (electric power steering, brakes); it requires formal verification and redundancy.
AUTOSAR Adaptive (since 2017) attempts to answer autonomous driving needs: a POSIX-compatible environment, dynamic allocation, containerization. Yet latency and certification requirements remain; running Linux PREEMPT_RT and calling it 'adaptive AUTOSAR' does not pass certification. It is genuinely used in advanced ADAS functions and infotainment; ASIL D functions still live on Classic because of behavior determinism.
Why is AUTOSAR Classic still used for ASIL D functions even when the 'modern' AUTOSAR Adaptive exists?
Avionics: ARINC 653, DO-178C, partitioning
Aviation's primary architectural principle is **partitioning**: different programs share a processor while being physically isolated in both time and memory. The **ARINC 653** standard defines an API for a real-time OS on onboard computers: a fixed Major Time Frame (for example 50 ms) divided into partition windows. Each partition receives a guaranteed slot - flight control 20 ms, navigation 10 ms, display 5 ms. Inside a partition the scheduling is conventional, but a partition cannot exceed its outer slot: a monotonic timer interrupts it in hardware. The result is **temporal isolation** between functions of different criticality.
Certification under **DO-178C** for DAL A (catastrophic outcome on failure) requires: 100% MC/DC structural code coverage, formal requirements, an independent verification team, bottom-up traceability. Development cost for DAL A code is $200-500 per line, against $10-50 for ordinary embedded C. That is why the A350 carries ~25 million lines of code and costs billions to develop.
How does temporal isolation in ARINC 653 differ from ordinary preemptive scheduling?
Medical devices: IEC 62304 and risk classification
Medical devices are classified by risk: **Class A** - failure causes no harm; **Class B** - may cause non-serious injury; **Class C** - may cause death or serious injury. The **IEC 62304** standard defines software development processes for each class. Class C covers infusion pumps, pacemakers, ventilators, and radiation therapy machines. The dominant architectural pattern is segregation of safety-critical logic from non-safety functions - either onto a separate microcontroller, or through RTOS partitioning analogous to avionics.
The textbook example that drove standards toward stricter rules is **Therac-25** (1985-87): a radiation therapy machine where a race condition in the mode-switching code caused six-fold radiation overdoses to patients. Three deaths. Root cause: a state variable was updated without checking the current mode, and an operator could switch modes within a window when the machine had already delivered the high dose. The case illustrates how software-only safety without hardware interlocks leads to catastrophe.
In a Class C infusion pump architecture, the safety check is offloaded to a separate MCU. Why?
Industrial: PLC, IEC 61131-3 and deterministic cycles
Industrial automation (factories, petrochemicals, power plants) is built on **PLC** (Programmable Logic Controllers) - specialized hardware with one fixed cycle: scan input -> execute logic -> write output. Cycle duration is typically 1-50 ms and is guaranteed by hardware. Programming uses the **IEC 61131-3** languages: Ladder Diagram (LD), Function Block Diagram (FBD), Structured Text (ST). Reactivity here does not mean interrupts - everything is polled every cycle. Safety-critical processes (SIS - Safety Instrumented Systems) are governed by **IEC 61508** with SIL 1-4 levels analogous to automotive ASIL.
The PLC paradigm is fundamentally different from event-driven RTOS: instead of reacting to interrupts, the system polls synchronously each scan cycle. That simplifies analysis: no race conditions, no priorities, execution as deterministic as combinational logic. The price is that latency cannot be shorter than a full cycle. For slow processes (chemical reactor with a time constant of minutes) this is ideal; for motion control sub-millisecond cycles are required, and PLCs give way to EtherCAT servo drives with specialized ASICs.
Why does the PLC paradigm avoid race conditions that are typical for RTOS applications?
Comparison and shared patterns
Four industries built outwardly different real-time ecosystems, yet behind the standards there are three shared patterns. **Temporal isolation**: a guaranteed time slot for each function regardless of others - ARINC partitions, AUTOSAR static schedules, PLC scan cycles. **Spatial isolation**: one function's memory is protected from another - MPU, separate MCUs, ARINC memory partitions. **Independence of failure**: a critical function operates even when a routine one has failed - safety MCUs, lockstep CPUs, hardware interlocks. They differ in the cost and the standards through which these patterns are expressed.
One common quantitative measure is the **failure rate target** for different criticality levels. ASIL D (auto): 10^-8 failure/hour. DAL A (avionics): 10^-9 failure/hour. SIL 4 (industrial): 10^-9 to 10^-8. Class C medical: typically 10^-7 to 10^-8. These numbers cap the acceptable residual bug count after certification and force either doubled architecture (dual redundancy) or tripled (TMR - Triple Modular Redundancy with voting) to reach the required MTBF.
A real-time system is just 'a fast computer'
Real-time means a guaranteed upper bound on latency, proven in advance. Average or peak speed does not matter - worst-case predictability does.
A PLC with a 10 ms scan cycle is real-time for a chemical process with a minutes-long time constant. The fastest supercomputer with unpredictable GC pauses is not real-time for quadcopter control. The goal of real-time design is not maximum performance but a provable bound on response time.
If ASIL D + DAL A + SIL 4 were to be combined in one certifiable component, what practical difficulty arises?
Key ideas
- **Automotive**: AUTOSAR Classic + ISO 26262 ASIL D, static task schedule on OSEK, lockstep CPU for critical functions; millions of units per year drive cost focus
- **Avionics**: ARINC 653 partitioning + DO-178C DAL A, formal verification, temporal isolation between partition windows, $200-500 per DAL A line of code
- **Medical**: IEC 62304 Class C, defence in depth - separate safety MCU + hardware interlocks + watchdog, Therac-25 history as a lesson against software-only safety
- **Industrial**: PLC + IEC 61131-3 + SIL 1-4, the scan-execute-write cycle is deterministic and single-threaded - races are eliminated constructively
- **Shared patterns**: temporal isolation, spatial isolation, independence of failure; the differences in standards are formal, the ideas are universal
Related topics
Real-time system design sits at the crossroads of several directions beyond any single industry:
- Schedulability analysis — Rate Monotonic and EDF provide formal methods to estimate worst-case timing and apply across all four industries
- Hardware design — Lockstep CPUs, watchdog timers, MPU - hardware mechanisms without which no certification of critical functions is feasible
- Formal methods — Frama-C, SPARK, TLA+ are used in avionics and in parts of automotive for ASIL D / DAL A functions
Вопросы для размышления
- If ISO 26262, DO-178C, IEC 62304, and IEC 61508 share common patterns, why is there still no single international real-time safety standard applicable across all industries?
- PLC scan cycles eliminate race conditions but cap latency at the cycle itself. In which scenarios is an event-driven RTOS architecture objectively more effective, and in which does it lose to a PLC?
- DAL A code costs $200-500 per line. If an autonomous car needs ~25 million lines of equivalent quality, what does that imply for the economics of mass autonomy?
Связанные уроки
- rts-12 — Fault tolerance is the foundation of RT system design
- rts-14 — System design opens the path to RT protocols
- net-15-tcp-basics — TCP/UDP and determinism in RT networks
- ds-01-intro — Distributed systems and RT share reliability challenges
- db-03-acid — ACID guarantees and RT determinism: analogous reliability requirements
- emb-01