ROS2 Performance with HDDS
Guide to optimizing ROS2 application performance using rmw_hdds.
Performance Benchmarks
HDDS Internal Benchmarks
Measured with benches/ suite (11 benchmark files: demux_latency, discovery_latency,
reliable_qos, rtps, runtime, stress, telemetry, latency_benchmark).
| Metric | Value | Transport | Notes |
|---|---|---|---|
| Write latency | 257 ns | UDP | Single write call, no network RTT |
| Intra-process latency | 280 ns | IntraProcess | Publisher to subscriber, same process |
| Throughput | 4.48 M msg/s | UDP | Small payloads, best-effort |
| SHM latency | < 1 us | Shared Memory | Futex-based ring buffer design |
SHM Transport Raw (Release Build)
Direct measurement of the HDDS shared-memory ring buffer, no RMW overhead:
| Operation | Latency |
|---|---|
| Push (writer) | 14.6 ns |
| Pop (reader) | 337.6 ns |
| End-to-end (push + pop) | 974.6 ns |
RMW SHM Roundtrip (publish_writer + try_shm_take)
Via ForeignRmwContext, includes mutex locks on shm_writers/shm_readers_by_topic:
| Build | Latency |
|---|---|
| Debug | 7,105 ns (7.1 µs) |
| Release | 3,337 ns (3.3 µs) |
Cross-RMW Benchmark — Apex.AI performance_test (Official)
Methodology: Apex.AI performance_test, Docker containers, 1 KB messages, 1000 Hz,
identical hardware, ROS 2 Humble. Results are reproducible and comparable.
| RMW | Mean Latency | Max Latency | Sent/s | Recv/s | vs rmw_hdds |
|---|---|---|---|---|---|
| rmw_hdds | 18.4 µs | 62.9 µs | 999 | 999 | 1x |
| rmw_cyclonedds_cpp | 104.8 µs | 161.1 µs | 999 | 999 | 5.7x slower |
| rmw_fastrtps_cpp | 137.8 µs | 165.3 µs | 999 | 999 | 7.5x slower |
rmw_hdds is 5.7x faster than CycloneDDS and 7.5x faster than FastRTPS. Zero message loss across all implementations.
# Same conditions: Docker, ROS 2 Humble, Apex.AI performance_test
docker run --rm ros:humble bash -c "
apt-get install -y ros-humble-performance-test &&
source /opt/ros/humble/setup.bash &&
ros2 run performance_test perf_test \
--communication ROS2 \
--msg Array1k \
--rate 1000 \
--max-runtime 60 \
--rmw rmw_hdds
"
Replace --rmw rmw_hdds with rmw_cyclonedds_cpp or rmw_fastrtps_cpp to reproduce other results.
UDP Transport Benchmark (v233 Self-Loopback Fix)
After fixing double-delivery on UDP self-loopback (v233), HDDS with the full UDP transport stack active (discovery threads, multicast sockets, RTPS router) was benchmarked. Same-process delivery still uses the TopicMerger path (intra-process routing), but all transport infrastructure runs in the background.
| Mode | Mean | Max | Notes |
|---|---|---|---|
Intra-only (HDDS_TRANSPORT=intra) | 28.7 µs | 76.9 µs | No UDP threads, zero transport overhead |
| UDP + self-loopback fix | 82.6 µs | 130.6 µs | Full transport stack, production-ready |
| First fix + 117 MB diagnostic spam | 112.4 µs | — | Not representative (eprintln overhead) |
Reference (same conditions, single-process):
| RMW | Mean | Notes |
|---|---|---|
| rmw_hdds (UDP stack) | 82.6 µs | Full transport active |
| rmw_cyclonedds_cpp | ~110 µs | Their full intra-process path |
| rmw_fastrtps_cpp | ~133 µs | Their full intra-process path |
rmw_hdds with full UDP transport is still 1.33x faster than CycloneDDS in same-process mode.
HDDS auto-routes same-process pub/sub through the TopicMerger (intra-process path: direct memory copy, no CDR, no sockets). CycloneDDS always serializes and sends over UDP even for same-process communication. The 18.4 µs headline (Apex.AI Docker benchmark) captures this optimization — it is a real product feature, not a trick.
The 82.6 µs measures HDDS with --ros-args full transport where UDP threads are active but
delivery still takes the intra path due to the self-loopback fix. This is the "worst case"
same-process number.
Full Transport Tier Comparison
| Transport | Latency | Mode | Status |
|---|---|---|---|
| SHM raw | 975 ns | Same process, bare ring buffer | Internal microbenchmark |
| RMW SHM | 3.3 µs | Same process, full rmw path | Internal microbenchmark |
Intra-only (HDDS_TRANSPORT=intra) | 28.7 µs | Single process, no UDP threads | Internal benchmark |
| rmw_hdds (auto) | 18.4 µs | Docker, single-process, intra routing | Official Apex.AI benchmark |
| rmw_hdds (UDP stack active) | 82.6 µs | Single-process, full transport | Internal benchmark (v233) |
| rmw_cyclonedds_cpp | 104.8 µs | Docker, single-process | Official Apex.AI benchmark |
| rmw_fastrtps_cpp | 137.8 µs | Docker, single-process | Official Apex.AI benchmark |
The 18.4 µs figure is the end-to-end production number when HDDS auto-routes same-process traffic through its intra-process path. The 82.6 µs is measured with the full UDP transport stack running in background — still faster than CycloneDDS in equivalent conditions.
Industry Comparison (Small Payloads, ~64 bytes)
Sources are third-party benchmarks from published reports and documentation. Hardware and methodology differ between sources -- these numbers are not directly comparable but give an order of magnitude.
| Implementation | Transport | Latency | Source |
|---|---|---|---|
| HDDS | Intra-process | ~280 ns | Internal benchmarks (benches/) |
| iceoryx2 | SHM zero-copy | < 100 ns | iceoryx2 v0.4.0 release |
| Zenoh | P2P | ~10 us | Zenoh benchmark blog |
| CycloneDDS | UDP multicast | ~8 us | Zenoh benchmark blog (64B, single machine) |
| CycloneDDS | UDP | ~17 us | CycloneDDS docs (i7 4.2GHz, roundtrip/2) |
| RTI Connext Micro | UDP best-effort | ~25 us | RTI Micro benchmarks (64B, Xeon, roundtrip/2) |
| RTI Connext Micro | UDP reliable | ~28 us | Same source |
| FastDDS | UDP sync | ~30-50 us | eProsima performance (estimated from graphs) |
RTI Connext DDS Micro -- Detailed Numbers (64B, Xeon x86_64, 1 Gbps)
From RTI official documentation:
| Mode | p50 | p99 | p99.99 |
|---|---|---|---|
| Best-effort unkeyed | 25 us | 26 us | 28 us |
| Reliable unkeyed | 26 us | 32 us | 34 us |
Methodology Notes
- HDDS "257 ns write latency" measures the write call duration, not network round-trip
- HDDS "280 ns intra-process" is same-process delivery (no serialization, no network)
- RTI/CycloneDDS numbers are round-trip / 2 over UDP between two processes
- iceoryx2 numbers are SHM-only, no network capability
- A proper apples-to-apples comparison requires running the same benchmark tool (e.g., Apex.AI performance_test) on the same hardware with all RMW implementations
Feature Comparison
| Feature | rmw_hdds | rmw_fastrtps | rmw_cyclonedds | rmw_iceoryx2 |
|---|---|---|---|---|
| Pub/Sub | Yes | Yes | Yes | Yes |
| Services/Clients | Yes | Yes | Yes | Alpha |
| Graph introspection | Yes | Yes | Yes | Partial |
| QoS (22 policies) | Yes | Yes | Yes | Partial |
| Content Filter | Yes | Yes | Yes | No |
| DDS Security v1.1 | Core only* | Yes | Yes | No |
| UDP Multicast | Yes | Yes | Yes | No (SHM only) |
| Shared Memory | Experimental | Yes | Yes (via iceoryx) | Yes |
| TCP/TLS | Yes | Yes | No | No |
| QUIC | Yes | No | No | No |
| Embedded (no_std) | Yes | No | No | No |
| Radio (LoRa/nRF24) | Yes | No | No | No |
| Core language | Rust | C++ | C | Rust |
| License | Apache-2.0/MIT | Apache-2.0 | EPL-2.0 | Apache-2.0 |
*Security is implemented in HDDS core but not yet wired through the rmw layer.
Interop Verification
HDDS has been tested for bidirectional interoperability with:
| Vendor | Version | Samples | Protocol |
|---|---|---|---|
| RTI Connext | 6.1.0 / 7.3.0 | 50/50 | RTPS 2.3 / 2.5 |
| eProsima FastDDS | 3.1.x | 50/50 | RTPS 2.3 |
| Eclipse CycloneDDS | 0.10.x | 50/50 | RTPS 2.3 |
| OpenDDS | - | 50/50 | RTPS 2.3 |
Quick Optimization
Enable Shared Memory
# Already enabled by default in HDDS
# Verify it's working:
export HDDS_LOG_LEVEL=info
ros2 run my_package my_node 2>&1 | grep -i "shared memory"
Use Best Effort for Sensors
from rclpy.qos import QoSProfile, ReliabilityPolicy, HistoryPolicy
sensor_qos = QoSProfile(
reliability=ReliabilityPolicy.BEST_EFFORT,
history=HistoryPolicy.KEEP_LAST,
depth=1
)
self.create_subscription(Imu, '/imu', self.imu_callback, sensor_qos)
Increase Buffer Sizes
# System-level tuning
sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216
QoS Optimization
Sensor Data (High Rate)
from rclpy.qos import QoSProfile, ReliabilityPolicy, HistoryPolicy, DurabilityPolicy
# Optimal for IMU, LIDAR, camera at high rates
sensor_qos = QoSProfile(
reliability=ReliabilityPolicy.BEST_EFFORT,
durability=DurabilityPolicy.VOLATILE,
history=HistoryPolicy.KEEP_LAST,
depth=1
)
Commands (Must Arrive)
# Optimal for cmd_vel, control commands
command_qos = QoSProfile(
reliability=ReliabilityPolicy.RELIABLE,
durability=DurabilityPolicy.VOLATILE,
history=HistoryPolicy.KEEP_LAST,
depth=10
)
State (Late Joiners Need)
# Optimal for robot_state, map data
state_qos = QoSProfile(
reliability=ReliabilityPolicy.RELIABLE,
durability=DurabilityPolicy.TRANSIENT_LOCAL,
history=HistoryPolicy.KEEP_LAST,
depth=1
)
Services
# ROS2 service QoS (fixed)
# HDDS automatically optimizes service communication
Transport Optimization
Same-Host Communication
<!-- hdds_config.xml -->
<hdds>
<transport>
<shared_memory enabled="true" prefer="true">
<segment_size_mb>256</segment_size_mb>
</shared_memory>
</transport>
</hdds>
Network Communication
<hdds>
<transport>
<udp enabled="true">
<send_buffer_size>16777216</send_buffer_size>
<receive_buffer_size>16777216</receive_buffer_size>
</udp>
<shared_memory enabled="false"/>
</transport>
</hdds>
Multi-Robot Fleet
<hdds>
<transport>
<udp>
<multicast enabled="false"/>
</udp>
</transport>
<discovery>
<static_peers>
<peer>${BASE_STATION}:7400</peer>
</static_peers>
</discovery>
</hdds>
Node Optimization
Callback Executor
import rclpy
from rclpy.executors import MultiThreadedExecutor
def main():
rclpy.init()
node1 = MyNode1()
node2 = MyNode2()
# Multi-threaded for parallel callbacks
executor = MultiThreadedExecutor(num_threads=4)
executor.add_node(node1)
executor.add_node(node2)
executor.spin()
Timer Optimization
# Use wall timer for control loops
self.create_wall_timer(0.01, self.control_callback) # 100 Hz
# Avoid creating many small timers
# Instead, use single timer with state machine
Subscription Callback
def fast_callback(self, msg):
# Do minimal work in callback
# Offload heavy processing to separate thread
self.process_queue.put(msg)
def slow_processing(self):
while True:
msg = self.process_queue.get()
self.heavy_computation(msg)
Message Optimization
Use Fixed-Size Messages
# Prefer fixed-size arrays
from std_msgs.msg import Float32MultiArray
# vs variable-length sequences
# Better: Create custom message with fixed arrays
# my_msgs/msg/FixedSensorData.msg
# float32[16] values
Avoid Large Messages
# Split large data into chunks
class ImageChunker:
def __init__(self, node, chunk_size=65536):
self.pub = node.create_publisher(Chunk, 'image_chunks', 10)
self.chunk_size = chunk_size
def publish_image(self, image_data):
for i in range(0, len(image_data), self.chunk_size):
chunk = Chunk()
chunk.sequence = i // self.chunk_size
chunk.data = image_data[i:i+self.chunk_size]
self.pub.publish(chunk)
Zero-Copy Transfer
// C++ only: Use loaned messages
auto msg = pub_->borrow_loaned_message();
msg.get().data = sensor_value;
pub_->publish(std::move(msg));
Launch File Optimization
CPU Affinity
from launch import LaunchDescription
from launch_ros.actions import Node
def generate_launch_description():
return LaunchDescription([
Node(
package='my_package',
executable='critical_node',
# Pin to specific CPU cores
prefix='taskset -c 0,1',
parameters=[{'use_intra_process_comms': True}]
),
])
Intra-Process Communication
Node(
package='image_proc',
executable='debayer_node',
parameters=[{'use_intra_process_comms': True}],
# Nodes in same process share memory directly
)
Composable Nodes
from launch_ros.actions import ComposableNodeContainer
from launch_ros.descriptions import ComposableNode
container = ComposableNodeContainer(
name='sensor_container',
namespace='',
package='rclcpp_components',
executable='component_container_mt',
composable_node_descriptions=[
ComposableNode(
package='sensor_driver',
plugin='SensorDriver',
),
ComposableNode(
package='sensor_filter',
plugin='SensorFilter',
),
],
)
System Tuning
Linux Kernel Parameters
# /etc/sysctl.d/90-ros2-hdds.conf
# Network buffers
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 4194304
net.core.wmem_default = 4194304
net.core.netdev_max_backlog = 100000
# Shared memory
kernel.shmmax = 268435456
kernel.shmall = 65536
# Apply
sudo sysctl -p /etc/sysctl.d/90-ros2-hdds.conf
Real-Time Priority
# /etc/security/limits.d/ros2.conf
@ros2 - rtprio 99
@ros2 - nice -20
@ros2 - memlock unlimited
# Add user to ros2 group
sudo groupadd ros2
sudo usermod -aG ros2 $USER
# In launch file
prefix='chrt -f 50'
CPU Governor
# Set to performance mode
sudo cpupower frequency-set -g performance
# Or per-core
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance | sudo tee $cpu
done
Profiling
ROS2 Tracing
# Install
sudo apt install ros-$ROS_DISTRO-tracetools-launch
# Trace
ros2 launch tracetools_launch example.launch.py
# Analyze
babeltrace /path/to/trace | grep -E "callback|publish"
HDDS Statistics
# Enable statistics
export HDDS_STATS_ENABLE=1
# Run node
ros2 run my_package my_node
# View stats
ros2 topic echo /hdds/statistics
Latency Measurement
import time
from rclpy.clock import Clock
class LatencyNode(Node):
def __init__(self):
super().__init__('latency_node')
self.pub = self.create_publisher(Stamped, 'ping', 10)
self.sub = self.create_subscription(Stamped, 'pong', self.pong_cb, 10)
self.latencies = []
def ping(self):
msg = Stamped()
msg.header.stamp = self.get_clock().now().to_msg()
self.pub.publish(msg)
def pong_cb(self, msg):
now = self.get_clock().now()
sent = Time.from_msg(msg.header.stamp)
latency = (now - sent).nanoseconds / 1e6 # ms
self.latencies.append(latency)
Performance Checklist
Configuration
- Enable shared memory for same-host nodes
- Use appropriate QoS for each topic type
- Configure adequate buffer sizes
- Tune discovery for deployment topology
Code
- Use composable nodes where possible
- Enable intra-process communication
- Avoid work in callbacks (queue to separate thread)
- Use fixed-size message types
System
- Set kernel parameters for networking/memory
- Configure CPU governor to performance
- Set real-time priorities for critical nodes
- Pin CPU affinity for determinism
Deployment
- Disable logging in production
- Use release builds
- Profile before/after optimization
- Monitor resource usage
Common Performance Issues
| Issue | Symptom | Solution |
|---|---|---|
| High latency | Delayed messages | Enable SHM, reduce history |
| Dropped messages | Missing data | Increase history depth |
| High CPU | Spinning | Use WaitSet, fix spin rate |
| Memory growth | OOM | Limit history, check leaks |
| Slow discovery | Late matching | Configure initial peers |
Next Steps
- rmw_hdds Configuration - Detailed settings
- Latency Tuning - Advanced latency optimization
- Throughput Tuning - Maximize bandwidth