Performance Benchmarks
All numbers on this page are measured and reproducible. Each section links to the script or tool used so you can verify on your own hardware.
Intra-Process Latency
Roundtrip write + read through the lock-free SlabPool/IndexRing path, no network
involved. Measured with examples/latency_benchmark (10,000 messages, 1,000
warmup, BestEffort KeepLast(1), 16-byte payload).
Reproduce:
# Single run
cargo run --example latency_benchmark --release
# Statistical run (100 iterations)
./scripts/bench-intra-process.sh 100
Results (100 runs per machine)
| Machine | CPU | p50 median | p50 best | Min ever | Sub-257ns runs |
|---|---|---|---|---|---|
| node2 | i9-9900KF @ 4.8GHz (turbo) | 247 ns | 243 ns | 242 ns | 53/100 |
| node4 | i7-6700K @ 4.0GHz (perf gov.) | 310 ns | 306 ns | 280 ns | 0/100 |
| devdeb | Xeon E5-2699 v4 @ 2.2GHz (VM) | ~50 us | - | 526 ns | 0/100 |
Date: 2026-03-04, HDDS v1.0.8, Linux 6.12, --release (no LTO).
The "257 ns" figure published on hdds.io was originally measured on the first HDDS version (rustc 1.83, release + LTO, i9-9900KF). The current codebase (v1.0.8, 100+ features) achieves 247 ns median on the same hardware, confirming the claim is accurate and even conservative.
Writer.write() Microbenchmark (Criterion)
Measured with cargo bench (Criterion, 64-byte Temperature struct).
This measures only the write path, not the full roundtrip.
cargo bench --package hdds -- write_latency
| Payload | Latency (release, no LTO) |
|---|---|
| 64 B | 2.49 us |
| 256 B | 2.54 us |
| 1 KB | 2.70 us |
Date: 2026-03-04, HDDS v1.0.8, Xeon E5-2699 v4 @ 2.2GHz.
Note: The Criterion bench measures the full writer.write() path including RTPS
framing and UDP send. The intra-process fast path (see above) bypasses these
layers entirely, which is why the example benchmark is much faster.
UDP Loopback Latency (Same Host)
End-to-end RTT measured with hdds-latency-probe (ping/pong, 64-byte payload,
500-1000 samples, Reliable QoS).
# Terminal 1
hdds-latency-probe pong --domain 0
# Terminal 2
hdds-latency-probe ping --domain 0 --size 64 --count 1000
Results
| Machine | CPU | p50 | p99 |
|---|---|---|---|
| node2 | i9-9900KF @ 3.6GHz | 246 us | 450 us |
| node4 | i7-6700K @ 4.0GHz | ~270 us | ~470 us |
| devdeb | Dual Xeon E5-2699 v4 | 246 us | 450 us |
Date: 2026-01-08, HDDS v0.8.0.
Best-Effort vs Reliable (i9-9900KF, loopback)
| QoS | Min | Avg | p99 |
|---|---|---|---|
| Reliable | 971 us | 1,288 us | 2,181 us |
| Best-Effort | 968 us | 1,230 us | 2,192 us |
Minimal difference on loopback (no packet loss to trigger retransmission).
Optimization History
| Version | Change | 64B p50 | vs baseline |
|---|---|---|---|
| v208 | 1ms polling | 1,134 us | baseline |
| v209 | 100us polling | 413 us | 2.7x faster |
| v210 | WakeNotifier (Condvar) | 354 us | 3.2x faster |
| v211 | Atomic fast-path + Spin | 280 us | 4.0x faster |
| v212 | mio/epoll listener | 246 us | 4.6x faster |
Comparison with CycloneDDS
Measured on the same machine, same test conditions, UDP loopback.
| Size | CycloneDDS p50 | HDDS p50 | Gap |
|---|---|---|---|
| 64B | 133 us | 246 us | 1.8x |
| 1KB | 120 us | ~270 us | 2.2x |
| 8KB | 150 us | ~290 us | 1.9x |
Date: 2026-01-08, CycloneDDS 0.10.x vs HDDS v0.8.0 (v212), Dual Xeon.
Both implementations achieved 0% packet loss across all tests.
Remaining gap is mainly due to extra memcpy in the receive path and thread context switches (listener to router thread handoff).
Component Benchmarks
Serialization (CDR2)
From cargo bench --package hdds -- serialize:
| Operation | 64 B | 256 B | 1 KB | 4 KB |
|---|---|---|---|---|
| Serialize | 50 ns | 120 ns | 400 ns | 1.5 us |
| Deserialize | 40 ns | 100 ns | 350 ns | 1.3 us |
Memory Usage
Per-Entity Overhead
| Entity | Memory |
|---|---|
| DomainParticipant | ~2 MB |
| Publisher | ~64 KB |
| Subscriber | ~64 KB |
| DataWriter | ~128 KB + history |
| DataReader | ~128 KB + history |
| Topic | ~16 KB |
History Cache Memory
Memory = instances x history_depth x sample_size + overhead
Example (100 sensors, 10-deep history, 1KB samples):
= 100 x 10 x 1024 + 100 x 256
= 1.02 MB + 25 KB = ~1.05 MB
Running Benchmarks
Intra-Process (fast path)
# Single run
cargo run --example latency_benchmark --release
# 100 runs with statistics
./scripts/bench-intra-process.sh 100
# Skip rebuild
./scripts/bench-intra-process.sh 100 --no-build
Tips for stable results:
- Copy binary to
/tmp(the script does this automatically) - Set CPU governor to
performance:sudo cpupower frequency-set -g performance - Close other workloads
Criterion Microbenchmarks
cargo bench --package hdds # All benchmarks
cargo bench --package hdds -- write_latency # Write path only
cargo bench --package hdds -- serialize # CDR2 only
UDP Latency (hdds-latency-probe)
# Terminal 1 - Responder
hdds-latency-probe pong --domain 0
# Terminal 2 - Benchmark
hdds-latency-probe ping --domain 0 --size 64 --count 1000
# JSON output
hdds-latency-probe ping --domain 0 --json
Profiling
CPU Profiling
# Using perf
perf record -g cargo run --release --example latency_benchmark
perf report
# Using flamegraph
cargo flamegraph --example latency_benchmark
Memory Profiling
heaptrack cargo run --release --example latency_benchmark
heaptrack_gui heaptrack.latency_benchmark.*.gz