Validator & Node Optimization - Infrastructure Engineering | XRPL Performance & Scaling | XRP Academy - XRP Academy
3 free lessons remaining this month

Free preview access resets monthly

Upgrade for Unlimited
Skip to main content
advanced55 min

Validator & Node Optimization - Infrastructure Engineering

Learning Objectives

Specify optimal hardware configurations for validators and stock nodes at different scales

Tune operating systems (Linux) for XRPL workloads

Configure rippled for maximum performance and reliability

Design monitoring and alerting for proactive issue detection

Calculate total cost of ownership for different deployment options

Every optimization discussed in previous lessons depends on solid infrastructure. A poorly configured server wastes CPU cycles on unnecessary work, hits I/O bottlenecks prematurely, and fails under load that well-tuned hardware handles easily.

This lesson provides the complete infrastructure playbook—from hardware specs to kernel parameters—for running production XRPL infrastructure.


Tier 1: Development / Testing

Purpose: Local development, testing
Capacity: <50 TPS

CPU: 4+ cores, 2.5 GHz+
Intel i5/Ryzen 5 or equivalent

Memory: 16 GB DDR4

Storage: 500 GB SSD (SATA acceptable)
Consumer NVMe preferred

Network: 100 Mbps

Estimated Cost: $500-1,000 (used/refurbished)

  • Development environments
  • Testnet participation
  • Learning/experimentation

Tier 2: Production Stock Node

Purpose: Serving application traffic, API access
Capacity: 100-500 TPS

CPU: 8+ cores, 3.0 GHz+
Intel Xeon E-series / AMD EPYC

Memory: 64 GB DDR4 ECC

Storage: 2 TB NVMe SSD
Enterprise grade (1+ DWPD)
Samsung PM1733 or equivalent

Network: 1 Gbps dedicated
Low latency to XRPL network

Estimated Cost: $3,000-6,000

  • Production applications
  • API service providers
  • Exchange integrations

Tier 3: Production Validator

Purpose: Consensus participation
Capacity: 500-1,500 TPS

CPU: 16+ cores, 3.5 GHz+
Intel Xeon Gold / AMD EPYC 7003
High single-thread performance important

Memory: 128 GB DDR4 ECC

Storage: 4 TB NVMe SSD
Enterprise grade (3+ DWPD)
RAID 1 for redundancy

Network: 1-10 Gbps dedicated
Multiple ISP redundancy recommended
Low latency global connectivity

Estimated Cost: $10,000-25,000

  • Mainnet validators
  • High-availability deployments
  • Institutional infrastructure

Tier 4: Enterprise / High-Performance

Purpose: Maximum throughput, full history
Capacity: 1,500+ TPS with headroom

CPU: 32+ cores, 4.0 GHz+
Intel Xeon Platinum / AMD EPYC 7003+
Maximum single-thread performance

Memory: 256 GB - 1 TB DDR4 ECC
Enable huge pages

Storage: 10+ TB NVMe SSD
RAID 10 configuration
Enterprise datacenter grade (10+ DWPD)

Network: 10-25 Gbps dedicated
Global anycast capability
DDoS protection

Estimated Cost: $50,000-100,000+

  • Infrastructure providers
  • Enterprise deployments
  • Research/testing at scale
  1. Single-thread performance (signature verification)
  2. Core count (parallel processing)
  3. Cache size (working set fits in L3)
  • AMD EPYC 7003/9004 series (best value)
  • Intel Xeon Gold 6300+ series
  • For maximum single-thread: Intel Xeon W-3300
  • ARM processors (limited rippled optimization)
  • Low-clock server chips (many slow cores)
  • Consumer desktop chips (no ECC, limited lifespan)
  • ECC (Error Correcting Code) - mandatory for validators
  • Registered/buffered for large capacity
  • Speed: 3200 MHz+ DDR4
  • Active state cache: ~10-20 GB
  • Transaction processing: ~20-40 GB
  • OS and overhead: ~10 GB
  • Headroom: 2× above
  • Minimum production: 64 GB
  • Recommended: 128 GB
  • Populate all channels for maximum bandwidth
  • Matched DIMMs for dual/quad channel
  • Sequential write: 3+ GB/s
  • Random write IOPS: 200K+
  • Endurance: 1+ DWPD (Drive Writes Per Day)
  • Power loss protection: Required for validators
  • Samsung PM1733/PM1735
  • Intel P5510/P5800X
  • Micron 9400 series
  • Kioxia CM6/CM7 series
  • Consumer NVMe (QLC, low endurance)
  • SATA SSDs (too slow for high throughput)
  • HDDs (completely unsuitable)
  • RAID 1: Basic redundancy
  • RAID 10: Best performance + redundancy
  • Hardware RAID controller with BBU/flash cache

Bandwidth:

Traffic analysis at different loads:

TPS    | Inbound    | Outbound   | Total
-------|------------|------------|--------
20     | 50 Kbps    | 200 Kbps   | 250 Kbps
100    | 250 Kbps   | 1 Mbps     | 1.25 Mbps
500    | 1.25 Mbps  | 5 Mbps     | 6.25 Mbps
1,500  | 3.75 Mbps  | 15 Mbps    | 18.75 Mbps

- Minimum: 10× expected peak traffic
- Stock node: 100 Mbps - 1 Gbps
- Validator: 1 Gbps - 10 Gbps

Latency Requirements:

Target latencies to major XRPL hubs:

Location       | Target RTT | Impact
---------------|------------|------------------
US East        | <20ms      | Fastest consensus
US West        | <50ms      | Good
Europe         | <100ms     | Acceptable
Asia-Pacific   | <150ms     | Workable
Global average | <200ms     | Should be below

- Higher latency = delayed proposal receipt
- May cause transaction omission from ledgers
- Not disqualifying but suboptimal

---

Network Tuning (/etc/sysctl.conf):

# Increase network buffer sizes
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.core.rmem_default = 31457280
net.core.wmem_default = 31457280

TCP buffer sizes

net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728

Connection handling

net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_max_syn_backlog = 65535

TCP keepalive (for WebSocket connections)

net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 10

Enable TCP fast open

net.ipv4.tcp_fastopen = 3

Disable slow start after idle

net.ipv4.tcp_slow_start_after_idle = 0
```

Memory Tuning:

# Virtual memory
vm.swappiness = 10
vm.dirty_ratio = 15
vm.dirty_background_ratio = 5
vm.vfs_cache_pressure = 50

Huge pages (calculate based on memory)

For 128GB system with 64GB for huge pages:

vm.nr_hugepages = 32768

Memory overcommit

vm.overcommit_memory = 1
vm.overcommit_ratio = 80
```

File System Tuning:

# Increase file descriptor limits
fs.file-max = 2097152
fs.nr_open = 2097152

Inotify limits (for monitoring)

fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 512
```

# For rippled user
rippled soft nofile 1000000
rippled hard nofile 1000000
rippled soft nproc 65535
rippled hard nproc 65535
rippled soft memlock unlimited
rippled hard memlock unlimited

Core dumps for debugging

rippled soft core unlimited
rippled hard core unlimited
```

# For NVMe drives, use 'none' (no scheduler)
echo "none" > /sys/block/nvme0n1/queue/scheduler

For SATA SSDs, use 'mq-deadline'

echo "mq-deadline" > /sys/block/sda/queue/scheduler

Increase queue depth

echo 1024 > /sys/block/nvme0n1/queue/nr_requests

Disable NCQ for problematic drives (rarely needed)

echo 1 > /sys/block/sda/device/queue_depth


# Recommended mount options for XRPL data
UUID=xxxx /var/lib/rippled ext4 defaults,noatime,nodiratime,discard,barrier=0 0 2

For XFS (alternative)

UUID=xxxx /var/lib/rippled xfs defaults,noatime,nodiratime,discard,allocsize=64k 0 2

Notes:

noatime/nodiratime: Don't update access times

discard: Enable TRIM for SSD

barrier=0: Disable barriers if using BBU RAID (careful!)

allocsize: XFS allocation size for large files



Server Section:

[server]
port_rpc_admin_local
port_peer
port_ws_admin_local
port_ws_public

[port_peer]
ip=0.0.0.0
port=51235
protocol=peer

[port_ws_public]
ip=0.0.0.0
port=6006
protocol=wss
admin=

[port_rpc_admin_local]
ip=127.0.0.1
port=5005
protocol=http
admin=127.0.0.1

[port_ws_admin_local]
ip=127.0.0.1
port=6007
protocol=ws
admin=127.0.0.1
```

Performance Section:

[node_size]
huge

Memory for ledger cache

[ledger_history]
full

Or for limited history:

[ledger_history]

256

[fetch_depth]
full

Database settings

[node_db]
type=NuDB
path=/var/lib/rippled/db/nudb
online_delete=256
advisory_delete=0

Or for RocksDB:

[node_db]

type=RocksDB

path=/var/lib/rippled/db/rocksdb

compression=lz4

online_delete=256

Transaction database

[transaction_db]
type=SQLite
path=/var/lib/rippled/db/transaction.db

Temporary database

[temp_db]
type=RocksDB
path=/var/lib/rippled/db/tempdb
```

Network Section:

[peers_max]
50

[peer_private]
0

Fixed peers (reliable well-connected nodes)

[ips_fixed]
s1.ripple.com 51235
s2.ripple.com 51235

Cluster for multiple own servers

[cluster_nodes]

nHUhG...nodepubkey1

nHUhG...nodepubkey2

[sntp_servers]
time.google.com
time.cloudflare.com
time.apple.com
```

Validator Configuration (if validating):

[validator_token]
eyJ2YWxpZGF0aW9uX...your_token_here

[validators_file]
validators.txt

For UNL (usually external file)

[validators]

nHUXe...validator1

nHBta...validator2


# Ledger cache size (adjust based on available RAM)
# Larger = more ledgers in memory = faster access
[ledger_history]
full

For memory-constrained systems:

[ledger_history]

256

State cache settings

auto = rippled calculates based on available RAM

[fetch_depth]
full

SQLite cache

[sqlite]
cache_size=-2097152 # 2GB cache (negative = KB)
```

Pre-flight Check Script:

#!/bin/bash
# pre-flight-check.sh - Verify system is optimized for rippled

echo "=== XRPL Node Pre-flight Check ==="

Check CPU

echo -n "CPU cores: "
nproc
echo -n "CPU frequency: "
lscpu | grep "MHz" | head -1

Check memory

echo -n "Total RAM: "
free -h | grep Mem | awk '{print $2}'

Check storage

echo -n "Storage type: "
cat /sys/block/nvme0n1/queue/rotational 2>/dev/null && echo "NVMe" || echo "Check manually"

Check file limits

echo -n "File descriptor limit: "
ulimit -n

Check kernel parameters

echo "=== Kernel Parameters ==="
sysctl net.core.rmem_max
sysctl vm.swappiness
sysctl fs.file-max

Check disk scheduler

echo -n "Disk scheduler: "
cat /sys/block/nvme0n1/queue/scheduler 2>/dev/null || echo "N/A"

echo "=== Check Complete ==="
```


Node Health:

metrics:
  - name: rippled_server_state
    description: Current server state
    alert_if: != "full"
  • name: rippled_complete_ledgers
  • name: rippled_peer_count
  • name: rippled_uptime

Performance:

metrics:
  - name: rippled_ledger_close_time
    description: Time to close ledger
    warning_if: > 5000  # ms
    alert_if: > 10000
  • name: rippled_transaction_queue
  • name: rippled_fetch_duration

Resource Utilization:

metrics:
  - name: cpu_utilization
    warning_if: > 70%
    alert_if: > 90%
  • name: memory_utilization
  • name: disk_io_utilization
  • name: disk_space
  • name: network_bandwidth

Prometheus + Grafana Setup:

# prometheus.yml
scrape_configs:
  - job_name: 'rippled'
    static_configs:
      - targets: ['localhost:5005']
    metrics_path: '/metrics'
    scheme: 'http'
  • job_name: 'node_exporter'

Grafana Dashboard Panels:

Dashboard: XRPL Node Health
  • Server State (stat)
  • Peer Count (gauge)
  • Uptime (stat)
  • Ledger Range (stat)
  • Ledger Close Time (time series)
  • Transaction Queue Depth (time series)
  • Transactions per Second (time series)
  • CPU Usage (time series)
  • Memory Usage (time series)
  • Disk I/O (time series)
  • Network Traffic (time series)
  • Active Alerts Table
  • Alert History
# alerting_rules.yml
groups:
  - name: rippled
    rules:
      - alert: RippledDown
        expr: up{job="rippled"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "rippled is down"
  • alert: RippledNotSynced
  • alert: HighLedgerCloseTime

Cloud vs. Bare Metal:

Scenario: Production Validator (Tier 3)
  • Instance: c6i.4xlarge or equivalent
  • Storage: 4TB gp3 NVMe
  • Network: 5TB transfer
  • Total Monthly: $1,250-1,550
  • Annual: $15,000-18,600
  • Hardware: $15,000 (amortized over 4 years)
  • Colocation: 2U, 10A, 1Gbps
  • Bandwidth: Included or $100-200
  • Total Monthly: $700-1,000
  • Annual: $8,400-12,000
  • Hardware: $15,000 (amortized over 4 years)
  • Power: ~300W × $0.12/kWh
  • Internet: Business 1Gbps
  • Total Monthly: $550-850
  • Annual: $6,600-10,200
  • Testing/Development: Cloud (flexibility)
  • Production <6 months: Cloud
  • Production >6 months: Bare metal
Scenario: High-Availability Validator (2 nodes)

| Cloud | Colo | On-Premise
--------------------|--------------|--------------|-------------
Year 1 | $36,000 | $38,000* | $40,000*
Year 2 | $36,000 | $15,000 | $10,000
Year 3 | $36,000 | $15,000 | $10,000
Year 4 | $36,000 | $38,000** | $40,000**

Year 5 $36,000 $15,000 $10,000
5-Year Total $180,000 $121,000 $110,000
Per Year Average $36,000 $24,200 $22,000
  • Includes initial hardware purchase
  • Cloud provides fastest deployment, highest flexibility
  • Colo provides best reliability/cost balance
  • On-premise cheapest but requires expertise
Additional operational costs to consider:

- Part-time monitoring: $5,000-10,000/year
- Full-time ops engineer: $100,000-150,000/year
- On-call coverage: $10,000-20,000/year

- Monitoring (PagerDuty, etc.): $1,000-5,000/year
- Security scanning: $2,000-10,000/year
- Backup services: $1,000-5,000/year

- Cyber insurance: $5,000-20,000/year
- Business continuity planning: Variable

- Small deployment: $10,000-20,000/year
- Medium deployment: $50,000-100,000/year
- Enterprise deployment: $200,000+/year

---

OS tuning provides measurable improvement—kernel parameters affect throughput

Monitoring prevents failures—proactive alerting catches issues early

TCO varies significantly by deployment model—cloud vs. bare metal trade-offs are real

⚠️ Long-term hardware requirements—depends on network growth

⚠️ Cloud cost trajectory—pricing changes frequently

📌 Over-provisioning initially—wasteful if growth doesn't materialize

📌 Ignoring monitoring—problems discovered by users not ops

📌 Single points of failure—no redundancy = inevitable outage

Infrastructure optimization provides the foundation for everything else. A properly configured server handles 2-3× the load of a default installation. The investment in proper hardware, tuning, and monitoring pays for itself in reliability and performance.


Assignment: Create a complete infrastructure specification for an XRPL deployment.

Requirements:

  • Define your use case and capacity requirements

  • Specify availability and latency targets

  • Document compliance/security requirements

  • Complete BOM (Bill of Materials) with specific parts

  • Justify each selection

  • Calculate total hardware cost

  • OS tuning parameters with explanations

  • Complete rippled.cfg

  • Monitoring configuration

  • 5-year TCO calculation

  • Cloud vs. bare metal comparison

  • Recommendation with justification

  • Complete, specific hardware specs (25%)

  • Correct configuration parameters (25%)

  • Thorough monitoring plan (25%)

  • Realistic cost analysis (25%)

Time investment: 3-4 hours


1. Why is ECC memory recommended for validators?

A) ECC is faster
B) ECC prevents bit-flip errors that could cause incorrect consensus
C) ECC uses less power
D) ECC is required by XRPL protocol

Correct Answer: B


2. What's the recommended I/O scheduler for NVMe SSDs running rippled?

A) cfq
B) deadline
C) none (no scheduler)
D) bfq

Correct Answer: C


3. Which node_size setting is appropriate for a production validator?

A) tiny
B) small
C) medium
D) huge

Correct Answer: D


4. At what peer count should alerts trigger?

A) < 100 peers
B) < 50 peers
C) < 10 peers
D) < 5 peers

Correct Answer: C


5. For a 2-year production deployment, which hosting model typically has lowest TCO?

A) Major cloud provider
B) Bare metal colocation
C) Home server
D) They're all the same

Correct Answer: B


  • rippled documentation on configuration
  • XRPL Foundation validator guides
  • Ripple server requirements
  • Brendan Gregg's Linux performance resources
  • Red Hat Performance Tuning Guide
  • Linux kernel documentation

For Next Lesson:
Lesson 10 covers Production Performance Patterns—real-world case studies and lessons learned.


End of Lesson 9

Total words: ~6,000
Estimated completion time: 55 minutes reading + 3-4 hours for deliverable

Key Takeaways

1

Hardware tiers exist for a reason

: Match your hardware to your actual needs—overprovisioning wastes money, underprovisioning causes failures.

2

OS tuning is free performance

: Kernel parameters, file limits, and I/O scheduling can improve performance 20-50% with no hardware changes.

3

rippled configuration matters

: node_size, cache settings, and peer configuration significantly affect behavior.

4

Monitoring is not optional

: You can't optimize what you don't measure. Instrument everything.

5

TCO analysis drives smart decisions

: Cloud is fastest to start; bare metal wins long-term for committed deployments. ---