Real-Time Systems

RT on Edge and Cloud: ROS2, Edge Computing, and Latency SLAs

Amazon Warehouse: 750,000 robots by 2024. Each connected to an edge coordinator in the same building, not to a cloud datacenter in Virginia. RTT to cloud: 50 ms. RTT to edge: 0.5 ms. At 1000 commands per second the difference in latency is the difference between an efficient warehouse and a collision-filled chaos.

NASA Perseverance: ROS 2 on Mars. Earth-Mars RTT: 3-22 minutes. All RT logic runs on-device
Boston Dynamics Spot on 5G: planning in near-edge, safety on-device, fleet management in cloud
AWS Wavelength: cloud compute inside 5G base stations, 1-5 ms RTT for edge robots

ROS 2 and DDS: Middleware for Distributed RT

ROS 1 did not die because it was bad. It died because it used a centralised master node: one coordinator process for the entire system. One failure - the whole robot system is paralysed. ROS 2 (2017) replaced this with DDS - Data Distribution Service. Decentralised publish-subscribe with no single point of failure. Waymo, Boston Dynamics, NASA Perseverance - all migrated to ROS 2.

**DDS (Data Distribution Service)**: an OMG standard for real-time middleware. **QoS policies** define communication characteristics: Reliability (RELIABLE vs BEST_EFFORT), Durability (keep the last message for late subscribers), Deadline (maximum inter-message gap), History (how many messages to buffer). Each topic has its own QoS profile. DDS latency: 100-500 us on loopback, 1-2 ms over Ethernet.

**Fast DDS vs CycloneDDS**: the two main DDS implementations for ROS 2. Fast DDS (eProsima): supports Discovery Server for large deployments, shared memory transport (zero-copy on one host). CycloneDDS: better latency on LAN, Eclipse Foundation. For RT the choice of transport is critical: UDP multicast for discovery, UDP unicast for data, shared memory for inter-process on one host (~10x faster).

Why does ROS 2 use DDS instead of the centralised master node from ROS 1?

ROS 1 master: one process stores the registry of all topics and services. Its failure = entire system paralysed. DDS: peer-to-peer discovery without a centre. Any node can discover others directly. This is mandatory for safety-critical systems requiring fault tolerance.

Edge Computing for RT: Latency SLAs and Cloud Offload

Cloud RT is an oxymoron. Round-trip time to AWS us-east-1: 15-30 ms from the US coast, 80-150 ms from Europe, 200+ ms from Asia. Hard RT deadline for robot control: 10 ms. The math does not work. The solution: **edge computing** - computation at network nodes close to sensors and actuators, with the cloud reserved for non-RT tasks only.

**RT compute hierarchy**: (1) **On-device RT** (<1 ms): motor control, safety monitor - on an MCU next to actuators; (2) **Edge RT** (1-20 ms): perception, planning - on an embedded GPU/SoC; (3) **Near-edge** (20-100 ms): ML inference, map updates - on an edge server in the local network; (4) **Cloud non-RT** (>100 ms): fleet management, model training, global maps. Each level has its own SLA and its own hardware.

**AWS Wavelength and Azure Edge Zones**: cloud services inside 5G base stations. Latency: 1-5 ms to a mobile device vs 50-100 ms to a regional datacenter. For a robot on 5G: perception in AWS Wavelength (5-10 ms RTT) + safety on-device (<1 ms). Boston Dynamics Spot: 5G edge deployment at industrial facilities - task planning in the cloud, execution locally.

Why can motor control not be offloaded to the cloud even with 5G connectivity?

Motor control requires a hard deadline of 1 ms. Even 5G with AWS Wavelength delivers RTT of 1-5 ms. Adding jitter (variability in delay): controlling via a network makes deadline violations probabilistic rather than deterministic. Hard RT requires local computation without network dependency.

Latency SLA: Design and Monitoring

A **latency SLA** is a Service Level Agreement for the temporal characteristics of an RT system. Not 'average latency' but **p99 deadline**: 99% of requests must complete within N ms. For hard RT: p100. For soft RT: p99.9. The difference is fundamental: average 5 ms with p99=100 ms is a disaster for RT. That is precisely why Google and Amazon publish p50/p95/p99/p99.9, not averages.

**ROS 2 rosbag2** and **Foxglove**: tools for post-mortem analysis of RT systems. They record all messages with timestamps and allow replaying incidents. When a deadline is violated, rosbag analysis shows which component missed its slot. NVIDIA Nsight Systems: profiler for GPU-accelerated RT tasks on Jetson Orin - shows CPU/GPU execution timeline with 1 us resolution.

Edge computing is slower than the cloud: cloud providers have powerful hardware and an optimised network stack.

Edge computing is faster for RT: cloud RTT is 15-200 ms vs edge <1 ms. The cloud is faster for batch ML training where throughput matters, not latency.

The speed of light: 300,000 km/s. London to New York: ~5,500 km = ~18 ms RTT from physics alone. Plus routing, load balancing, serialisation. Edge compute: co-located with sensors, <1 ms RTT. For hard RT only edge or on-device compute can meet the requirements.

An RT system shows average latency 5 ms against an SLA of 20 ms. Why does this not mean the SLA is met?

RT SLA: one deadline violation can cause an accident regardless of average latency. Heavy tails arise from OS scheduling jitter, cache misses, and GC pauses. Monitoring must track p99, p99.9, and maximum, not just the mean.

Key Ideas

ROS 2 DDS: decentralised publish-subscribe without a master; QoS policies per topic
RT hierarchy: on-device (<1 ms) -> edge (1-20 ms) -> near-edge (20-100 ms) -> cloud (>100 ms)
Cloud RT is impossible: physics of RTT is incompatible with hard RT deadlines below 10 ms
Latency SLA: track p99/p99.9, not average - one outlier can be catastrophic
DDS QoS: Deadline policy + Reliability=RELIABLE for safety-critical topics

Вопросы для размышления

Perseverance runs ROS 2 with an Earth-Mars RTT of 20 minutes. How should autonomy be designed under such a constraint?
5G Standalone promises 1 ms latency. How does this change the RT hierarchy and what remains on-device?
ROS 2 DDS discovery creates network overhead with a large number of nodes. How does the Fast DDS Discovery Server solve this in fleet deployments?

Связанные уроки

rts-17 — Safety standards define the latency SLA requirements at the edge
rts-16 — The RT stack of autonomous systems is the base architecture for edge compute
rts-03 — Basic RT concepts extend to distributed edge systems
rts-06 — Schedulability analysis is applied to prove deadline feasibility in edge RT nodes