Skip to main content

ROS2 Performance with HDDS

Guide to optimizing ROS2 application performance using rmw_hdds.

Performance Benchmarks

HDDS Internal Benchmarks

Measured with benches/ suite (11 benchmark files: demux_latency, discovery_latency, reliable_qos, rtps, runtime, stress, telemetry, latency_benchmark).

MetricValueTransportNotes
Write latency257 nsUDPSingle write call, no network RTT
Intra-process latency280 nsIntraProcessPublisher to subscriber, same process
Throughput4.48 M msg/sUDPSmall payloads, best-effort
SHM latency< 1 usShared MemoryFutex-based ring buffer design

SHM Transport Raw (Release Build)

Direct measurement of the HDDS shared-memory ring buffer, no RMW overhead:

OperationLatency
Push (writer)14.6 ns
Pop (reader)337.6 ns
End-to-end (push + pop)974.6 ns

RMW SHM Roundtrip (publish_writer + try_shm_take)

Via ForeignRmwContext, includes mutex locks on shm_writers/shm_readers_by_topic:

BuildLatency
Debug7,105 ns (7.1 µs)
Release3,337 ns (3.3 µs)

Cross-RMW Benchmark — Apex.AI performance_test (Official)

Methodology: Apex.AI performance_test, Docker containers, 1 KB messages, 1000 Hz, identical hardware, ROS 2 Humble. Results are reproducible and comparable.

RMWMean LatencyMax LatencySent/sRecv/svs rmw_hdds
rmw_hdds18.4 µs62.9 µs9999991x
rmw_cyclonedds_cpp104.8 µs161.1 µs9999995.7x slower
rmw_fastrtps_cpp137.8 µs165.3 µs9999997.5x slower

rmw_hdds is 5.7x faster than CycloneDDS and 7.5x faster than FastRTPS. Zero message loss across all implementations.

Reproduce this benchmark
# Same conditions: Docker, ROS 2 Humble, Apex.AI performance_test
docker run --rm ros:humble bash -c "
apt-get install -y ros-humble-performance-test &&
source /opt/ros/humble/setup.bash &&
ros2 run performance_test perf_test \
--communication ROS2 \
--msg Array1k \
--rate 1000 \
--max-runtime 60 \
--rmw rmw_hdds
"

Replace --rmw rmw_hdds with rmw_cyclonedds_cpp or rmw_fastrtps_cpp to reproduce other results.

UDP Transport Benchmark (v233 Self-Loopback Fix)

After fixing double-delivery on UDP self-loopback (v233), HDDS with the full UDP transport stack active (discovery threads, multicast sockets, RTPS router) was benchmarked. Same-process delivery still uses the TopicMerger path (intra-process routing), but all transport infrastructure runs in the background.

ModeMeanMaxNotes
Intra-only (HDDS_TRANSPORT=intra)28.7 µs76.9 µsNo UDP threads, zero transport overhead
UDP + self-loopback fix82.6 µs130.6 µsFull transport stack, production-ready
First fix + 117 MB diagnostic spam112.4 µsNot representative (eprintln overhead)

Reference (same conditions, single-process):

RMWMeanNotes
rmw_hdds (UDP stack)82.6 µsFull transport active
rmw_cyclonedds_cpp~110 µsTheir full intra-process path
rmw_fastrtps_cpp~133 µsTheir full intra-process path

rmw_hdds with full UDP transport is still 1.33x faster than CycloneDDS in same-process mode.

Why the 18.4 µs vs 82.6 µs difference?

HDDS auto-routes same-process pub/sub through the TopicMerger (intra-process path: direct memory copy, no CDR, no sockets). CycloneDDS always serializes and sends over UDP even for same-process communication. The 18.4 µs headline (Apex.AI Docker benchmark) captures this optimization — it is a real product feature, not a trick.

The 82.6 µs measures HDDS with --ros-args full transport where UDP threads are active but delivery still takes the intra path due to the self-loopback fix. This is the "worst case" same-process number.

Full Transport Tier Comparison

TransportLatencyModeStatus
SHM raw975 nsSame process, bare ring bufferInternal microbenchmark
RMW SHM3.3 µsSame process, full rmw pathInternal microbenchmark
Intra-only (HDDS_TRANSPORT=intra)28.7 µsSingle process, no UDP threadsInternal benchmark
rmw_hdds (auto)18.4 µsDocker, single-process, intra routingOfficial Apex.AI benchmark
rmw_hdds (UDP stack active)82.6 µsSingle-process, full transportInternal benchmark (v233)
rmw_cyclonedds_cpp104.8 µsDocker, single-processOfficial Apex.AI benchmark
rmw_fastrtps_cpp137.8 µsDocker, single-processOfficial Apex.AI benchmark

The 18.4 µs figure is the end-to-end production number when HDDS auto-routes same-process traffic through its intra-process path. The 82.6 µs is measured with the full UDP transport stack running in background — still faster than CycloneDDS in equivalent conditions.

Industry Comparison (Small Payloads, ~64 bytes)

Sources are third-party benchmarks from published reports and documentation. Hardware and methodology differ between sources -- these numbers are not directly comparable but give an order of magnitude.

ImplementationTransportLatencySource
HDDSIntra-process~280 nsInternal benchmarks (benches/)
iceoryx2SHM zero-copy< 100 nsiceoryx2 v0.4.0 release
ZenohP2P~10 usZenoh benchmark blog
CycloneDDSUDP multicast~8 usZenoh benchmark blog (64B, single machine)
CycloneDDSUDP~17 usCycloneDDS docs (i7 4.2GHz, roundtrip/2)
RTI Connext MicroUDP best-effort~25 usRTI Micro benchmarks (64B, Xeon, roundtrip/2)
RTI Connext MicroUDP reliable~28 usSame source
FastDDSUDP sync~30-50 useProsima performance (estimated from graphs)

RTI Connext DDS Micro -- Detailed Numbers (64B, Xeon x86_64, 1 Gbps)

From RTI official documentation:

Modep50p99p99.99
Best-effort unkeyed25 us26 us28 us
Reliable unkeyed26 us32 us34 us

Methodology Notes

Comparing apples to oranges
  • HDDS "257 ns write latency" measures the write call duration, not network round-trip
  • HDDS "280 ns intra-process" is same-process delivery (no serialization, no network)
  • RTI/CycloneDDS numbers are round-trip / 2 over UDP between two processes
  • iceoryx2 numbers are SHM-only, no network capability
  • A proper apples-to-apples comparison requires running the same benchmark tool (e.g., Apex.AI performance_test) on the same hardware with all RMW implementations

Feature Comparison

Featurermw_hddsrmw_fastrtpsrmw_cycloneddsrmw_iceoryx2
Pub/SubYesYesYesYes
Services/ClientsYesYesYesAlpha
Graph introspectionYesYesYesPartial
QoS (22 policies)YesYesYesPartial
Content FilterYesYesYesNo
DDS Security v1.1Core only*YesYesNo
UDP MulticastYesYesYesNo (SHM only)
Shared MemoryExperimentalYesYes (via iceoryx)Yes
TCP/TLSYesYesNoNo
QUICYesNoNoNo
Embedded (no_std)YesNoNoNo
Radio (LoRa/nRF24)YesNoNoNo
Core languageRustC++CRust
LicenseApache-2.0/MITApache-2.0EPL-2.0Apache-2.0

*Security is implemented in HDDS core but not yet wired through the rmw layer.

Interop Verification

HDDS has been tested for bidirectional interoperability with:

VendorVersionSamplesProtocol
RTI Connext6.1.0 / 7.3.050/50RTPS 2.3 / 2.5
eProsima FastDDS3.1.x50/50RTPS 2.3
Eclipse CycloneDDS0.10.x50/50RTPS 2.3
OpenDDS-50/50RTPS 2.3

Quick Optimization

Enable Shared Memory

# Already enabled by default in HDDS
# Verify it's working:
export HDDS_LOG_LEVEL=info
ros2 run my_package my_node 2>&1 | grep -i "shared memory"

Use Best Effort for Sensors

from rclpy.qos import QoSProfile, ReliabilityPolicy, HistoryPolicy

sensor_qos = QoSProfile(
reliability=ReliabilityPolicy.BEST_EFFORT,
history=HistoryPolicy.KEEP_LAST,
depth=1
)

self.create_subscription(Imu, '/imu', self.imu_callback, sensor_qos)

Increase Buffer Sizes

# System-level tuning
sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216

QoS Optimization

Sensor Data (High Rate)

from rclpy.qos import QoSProfile, ReliabilityPolicy, HistoryPolicy, DurabilityPolicy

# Optimal for IMU, LIDAR, camera at high rates
sensor_qos = QoSProfile(
reliability=ReliabilityPolicy.BEST_EFFORT,
durability=DurabilityPolicy.VOLATILE,
history=HistoryPolicy.KEEP_LAST,
depth=1
)

Commands (Must Arrive)

# Optimal for cmd_vel, control commands
command_qos = QoSProfile(
reliability=ReliabilityPolicy.RELIABLE,
durability=DurabilityPolicy.VOLATILE,
history=HistoryPolicy.KEEP_LAST,
depth=10
)

State (Late Joiners Need)

# Optimal for robot_state, map data
state_qos = QoSProfile(
reliability=ReliabilityPolicy.RELIABLE,
durability=DurabilityPolicy.TRANSIENT_LOCAL,
history=HistoryPolicy.KEEP_LAST,
depth=1
)

Services

# ROS2 service QoS (fixed)
# HDDS automatically optimizes service communication

Transport Optimization

Same-Host Communication

<!-- hdds_config.xml -->
<hdds>
<transport>
<shared_memory enabled="true" prefer="true">
<segment_size_mb>256</segment_size_mb>
</shared_memory>
</transport>
</hdds>

Network Communication

<hdds>
<transport>
<udp enabled="true">
<send_buffer_size>16777216</send_buffer_size>
<receive_buffer_size>16777216</receive_buffer_size>
</udp>
<shared_memory enabled="false"/>
</transport>
</hdds>

Multi-Robot Fleet

<hdds>
<transport>
<udp>
<multicast enabled="false"/>
</udp>
</transport>
<discovery>
<static_peers>
<peer>${BASE_STATION}:7400</peer>
</static_peers>
</discovery>
</hdds>

Node Optimization

Callback Executor

import rclpy
from rclpy.executors import MultiThreadedExecutor

def main():
rclpy.init()

node1 = MyNode1()
node2 = MyNode2()

# Multi-threaded for parallel callbacks
executor = MultiThreadedExecutor(num_threads=4)
executor.add_node(node1)
executor.add_node(node2)

executor.spin()

Timer Optimization

# Use wall timer for control loops
self.create_wall_timer(0.01, self.control_callback) # 100 Hz

# Avoid creating many small timers
# Instead, use single timer with state machine

Subscription Callback

def fast_callback(self, msg):
# Do minimal work in callback
# Offload heavy processing to separate thread
self.process_queue.put(msg)

def slow_processing(self):
while True:
msg = self.process_queue.get()
self.heavy_computation(msg)

Message Optimization

Use Fixed-Size Messages

# Prefer fixed-size arrays
from std_msgs.msg import Float32MultiArray
# vs variable-length sequences

# Better: Create custom message with fixed arrays
# my_msgs/msg/FixedSensorData.msg
# float32[16] values

Avoid Large Messages

# Split large data into chunks
class ImageChunker:
def __init__(self, node, chunk_size=65536):
self.pub = node.create_publisher(Chunk, 'image_chunks', 10)
self.chunk_size = chunk_size

def publish_image(self, image_data):
for i in range(0, len(image_data), self.chunk_size):
chunk = Chunk()
chunk.sequence = i // self.chunk_size
chunk.data = image_data[i:i+self.chunk_size]
self.pub.publish(chunk)

Zero-Copy Transfer

// C++ only: Use loaned messages
auto msg = pub_->borrow_loaned_message();
msg.get().data = sensor_value;
pub_->publish(std::move(msg));

Launch File Optimization

CPU Affinity

from launch import LaunchDescription
from launch_ros.actions import Node

def generate_launch_description():
return LaunchDescription([
Node(
package='my_package',
executable='critical_node',
# Pin to specific CPU cores
prefix='taskset -c 0,1',
parameters=[{'use_intra_process_comms': True}]
),
])

Intra-Process Communication

Node(
package='image_proc',
executable='debayer_node',
parameters=[{'use_intra_process_comms': True}],
# Nodes in same process share memory directly
)

Composable Nodes

from launch_ros.actions import ComposableNodeContainer
from launch_ros.descriptions import ComposableNode

container = ComposableNodeContainer(
name='sensor_container',
namespace='',
package='rclcpp_components',
executable='component_container_mt',
composable_node_descriptions=[
ComposableNode(
package='sensor_driver',
plugin='SensorDriver',
),
ComposableNode(
package='sensor_filter',
plugin='SensorFilter',
),
],
)

System Tuning

Linux Kernel Parameters

# /etc/sysctl.d/90-ros2-hdds.conf

# Network buffers
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 4194304
net.core.wmem_default = 4194304
net.core.netdev_max_backlog = 100000

# Shared memory
kernel.shmmax = 268435456
kernel.shmall = 65536

# Apply
sudo sysctl -p /etc/sysctl.d/90-ros2-hdds.conf

Real-Time Priority

# /etc/security/limits.d/ros2.conf
@ros2 - rtprio 99
@ros2 - nice -20
@ros2 - memlock unlimited

# Add user to ros2 group
sudo groupadd ros2
sudo usermod -aG ros2 $USER

# In launch file
prefix='chrt -f 50'

CPU Governor

# Set to performance mode
sudo cpupower frequency-set -g performance

# Or per-core
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance | sudo tee $cpu
done

Profiling

ROS2 Tracing

# Install
sudo apt install ros-$ROS_DISTRO-tracetools-launch

# Trace
ros2 launch tracetools_launch example.launch.py

# Analyze
babeltrace /path/to/trace | grep -E "callback|publish"

HDDS Statistics

# Enable statistics
export HDDS_STATS_ENABLE=1

# Run node
ros2 run my_package my_node

# View stats
ros2 topic echo /hdds/statistics

Latency Measurement

import time
from rclpy.clock import Clock

class LatencyNode(Node):
def __init__(self):
super().__init__('latency_node')
self.pub = self.create_publisher(Stamped, 'ping', 10)
self.sub = self.create_subscription(Stamped, 'pong', self.pong_cb, 10)
self.latencies = []

def ping(self):
msg = Stamped()
msg.header.stamp = self.get_clock().now().to_msg()
self.pub.publish(msg)

def pong_cb(self, msg):
now = self.get_clock().now()
sent = Time.from_msg(msg.header.stamp)
latency = (now - sent).nanoseconds / 1e6 # ms
self.latencies.append(latency)

Performance Checklist

Configuration

  • Enable shared memory for same-host nodes
  • Use appropriate QoS for each topic type
  • Configure adequate buffer sizes
  • Tune discovery for deployment topology

Code

  • Use composable nodes where possible
  • Enable intra-process communication
  • Avoid work in callbacks (queue to separate thread)
  • Use fixed-size message types

System

  • Set kernel parameters for networking/memory
  • Configure CPU governor to performance
  • Set real-time priorities for critical nodes
  • Pin CPU affinity for determinism

Deployment

  • Disable logging in production
  • Use release builds
  • Profile before/after optimization
  • Monitor resource usage

Common Performance Issues

IssueSymptomSolution
High latencyDelayed messagesEnable SHM, reduce history
Dropped messagesMissing dataIncrease history depth
High CPUSpinningUse WaitSet, fix spin rate
Memory growthOOMLimit history, check leaks
Slow discoveryLate matchingConfigure initial peers

Next Steps