Troubleshooting and Incident Response | Running an XRPL Validator | XRP Academy - XRP Academy
3 free lessons remaining this month

Free preview access resets monthly

Upgrade for Unlimited
Skip to main content
advanced60 min

Troubleshooting and Incident Response

Learning Objectives

Apply systematic troubleshooting methodology to validator issues

Diagnose common validator problems using logs, metrics, and diagnostic commands

Execute incident response procedures under pressure

Conduct post-incident analysis to prevent recurrence

Document incidents for operational learning and improvement

Every validator operator will face incidents. The goal isn't perfection—it's resilience:

Incident Reality:

- Software bugs will affect your server
- Network issues will disrupt connectivity
- Hardware will eventually fail
- Human error will cause problems
- External factors will create challenges

- Detection time (minutes vs. hours)
- Response quality (systematic vs. panicked)
- Resolution speed (prepared vs. improvised)
- Learning captured (improved vs. repeated)
- Documentation (complete vs. missing)

This lesson prepares you for the inevitable.

---
Troubleshooting Framework:

1. DETECT

1. ASSESS

1. HYPOTHESIZE

1. TEST

1. RESOLVE

1. DOCUMENT
# Initial diagnostic commands
# Run these first when investigating issues

1. Is rippled running?

systemctl status rippled
pgrep -a rippled

2. What's the server state?

/opt/ripple/bin/rippled server_info 2>&1 | head -50

3. Recent log entries

sudo tail -100 /var/log/rippled/debug.log

4. System resources

free -h
df -h
top -bn1 | head -20

5. Network status

ss -tlnp | grep rippled
/opt/ripple/bin/rippled peers | head -30
```

Start Here: What's the symptom?

Symptom: Server not responding
├── Is rippled process running?
│ ├── No → Check why it stopped (logs, OOM, crash)
│ └── Yes → Check if RPC responding
│ ├── No → Check admin port binding
│ └── Yes → Proceed to state check

Symptom: Not in "proposing" state
├── What state is it in?
│ ├── "full" → Token not loaded or key issue
│ ├── "syncing" → Synchronization problem
│ ├── "tracking" → Partial sync, connectivity?
│ └── "connected" → Peer/network issues

Symptom: Low peer count
├── Is firewall correct?
│ ├── No → Fix firewall rules
│ └── Yes → Check external reachability
│ ├── Port blocked → Network/ISP issue
│ └── Port open → Check fixed peers

Symptom: High ledger age
├── Check peer count
│ ├── Low → Fix connectivity first
│ └── Normal → Check resource usage
│ ├── High CPU/Memory → Resource issue
│ └── Normal → Network latency?
```


  • Service fails to start
  • Process exits immediately
  • Errors in journal

Diagnostic:

# Check service status
sudo systemctl status rippled

Check journal for errors

sudo journalctl -u rippled --no-pager | tail -50

Try starting manually for more output

sudo /opt/ripple/bin/rippled --conf /opt/ripple/etc/rippled.cfg --fg
```

Common Causes and Solutions:

  • Symptom: "Error in configuration file"

  • Solution: Check rippled.cfg syntax, validate JSON sections

  • Symptom: "Address already in use"

  • Solution: Check for zombie process, kill if needed

  • Symptom: "Database error" or similar

  • Solution: May need to delete and resync

  • Symptom: OOM killer messages in dmesg

  • Solution: Reduce node_size or add memory

  • Symptom: "Permission denied" errors

  • Solution: Check file ownership

  • State stuck at "full"
  • No pubkey_validator in server_info
  • Validation not occurring

Diagnostic:

# Check for validator token
/opt/ripple/bin/rippled server_info | grep pubkey_validator

Check configuration

sudo grep -A5 "validator_token" /opt/ripple/etc/rippled.cfg
```

Common Causes and Solutions:

  • Symptom: No pubkey_validator field

  • Solution: Add [validator_token] section to config

  • Symptom: Errors mentioning token in logs

  • Solution: Regenerate token, ensure complete copy

  • Symptom: pubkey_validator doesn't match expected

  • Solution: Generate token from correct master key

  • Symptom: Changes not taking effect

  • Solution: Restart rippled after config changes

  • Ledger age continuously high
  • State fluctuating
  • "syncing" state for extended periods

Diagnostic:

# Check ledger age
/opt/ripple/bin/rippled server_info | grep -A5 "validated_ledger"

Check peer quality

/opt/ripple/bin/rippled peers | grep -E "latency|status"

Check for network issues

ping -c 10 r.ripple.com
```

Common Causes and Solutions:

  • Symptom: High latency or packet loss

  • Solution: Check network path, consider different peers

  • Symptom: Low peer count (<5)

  • Solution: Check firewall, add fixed peers

  • Symptom: High CPU, memory, or IO

  • Solution: Check resources, consider hardware upgrade

  • Symptom: Time-related errors in logs

  • Solution: Verify NTP synchronization

  • Server becomes slow
  • OOM killer invoked
  • Swap usage high

Diagnostic:

# Check memory usage
free -h

Check swap

swapon --show

Check rippled memory

ps aux | grep rippled | awk '{print $4, $6}'

Check for OOM events

dmesg | grep -i "out of memory"
```

Solutions:

  • Edit rippled.cfg: node_size = medium (instead of large)

  • Trades performance for memory

  • sudo fallocate -l 4G /swapfile

  • sudo chmod 600 /swapfile

  • sudo mkswap /swapfile

  • sudo swapon /swapfile

  • Note: Swap is not ideal for validators

  • Long-term solution

  • 64 GB recommended for validators

  • Keeps less history

  • Reduces memory pressure

  • Disk full errors
  • rippled failing to write
  • Performance degradation

Diagnostic:

# Check disk usage
df -h

Find large files

sudo du -sh /var/lib/rippled/db/*
sudo du -sh /var/log/rippled/*

Check for log growth

ls -lh /var/log/rippled/
```

Solutions:

  • sudo /opt/ripple/bin/cleanup-logs.sh

  • Verify logrotate is working

  • Lower online_delete in config

  • Restart rippled

  • Cloud: Resize volume

  • Bare metal: Add disk

  • sudo systemctl stop rippled

  • sudo rm -rf /var/lib/rippled/db/*

  • sudo systemctl start rippled

  • Note: Requires full resync


Severity Levels:

- Validator completely down
- Security breach suspected
- Data integrity at risk
- Response: Immediate, all hands

- Validator not validating
- Significant performance degradation
- Security concern (not active breach)
- Response: Within 1 hour

- Degraded performance
- Non-critical issues
- Warning-level alerts
- Response: Same business day

- Minor issues
- Cosmetic problems
- Enhancement requests
- Response: Next maintenance window
# Create incident response checklist
cat > ~/incident-response-checklist.md << 'EOF'
# Incident Response Checklist
  • [ ] Complete incident report
  • [ ] Schedule post-mortem (if P1/P2)
  • [ ] Update documentation
  • [ ] Implement preventive measures

Complete Validator Failure:

#!/bin/bash
# Emergency recovery procedure

echo "=== EMERGENCY RECOVERY ==="
echo "Time: $(date)"

# 1. Check if process is running
if pgrep -x rippled > /dev/null; then
    echo "rippled is running - checking state"
    /opt/ripple/bin/rippled server_info | grep server_state
else
    echo "rippled NOT running - attempting restart"
    sudo systemctl restart rippled
    sleep 30
fi

# 2. Check state after restart
STATE=$(/opt/ripple/bin/rippled server_info 2>/dev/null | grep -o '"server_state" : "[^"]*"' | cut -d'"' -f4)

if [ "$STATE" = "proposing" ]; then
    echo "RECOVERED: Validator is proposing"
elif [ "$STATE" = "full" ] || [ "$STATE" = "syncing" ]; then
    echo "PARTIAL: Validator is $STATE - monitor recovery"
else
    echo "CRITICAL: Unexpected state '$STATE' - manual intervention needed"
fi

Security Incident:

#!/bin/bash
# Security incident response

echo "=== SECURITY INCIDENT RESPONSE ==="
echo "Time: $(date)"

# 1. Stop validator to prevent further damage
echo "Stopping rippled..."
sudo systemctl stop rippled

# 2. Preserve evidence
echo "Preserving logs..."
INCIDENT_DIR="/tmp/incident_$(date +%Y%m%d_%H%M%S)"
mkdir -p "$INCIDENT_DIR"
cp -r /var/log/rippled "$INCIDENT_DIR/"
cp /var/log/auth.log "$INCIDENT_DIR/"
cp /opt/ripple/etc/rippled.cfg "$INCIDENT_DIR/"

# 3. Check for unauthorized access
echo "Checking for unauthorized access..."
last | head -20 > "$INCIDENT_DIR/last.txt"
who > "$INCIDENT_DIR/who.txt"
ps aux > "$INCIDENT_DIR/ps.txt"
netstat -plant > "$INCIDENT_DIR/netstat.txt"

# 4. Instructions
echo ""
echo "Evidence preserved to: $INCIDENT_DIR"
echo ""
echo "NEXT STEPS:"
echo "1. Analyze preserved evidence"
echo "2. Determine scope of compromise"
echo "3. If token compromised: Generate new token"
echo "4. If server compromised: Rebuild from scratch"
echo "5. Document incident thoroughly"

# Search for errors in time range
sudo grep -E "$(date -d '1 hour ago' '+%Y-%m-%d %H')" /var/log/rippled/debug.log | grep -i error

Find all unique error types

sudo grep -i error /var/log/rippled/debug.log | awk '{print $NF}' | sort | uniq -c | sort -rn

Track state changes

sudo grep -i "server_state|state change" /var/log/rippled/debug.log | tail -50

Find connection issues

sudo grep -i "disconnect|connection|timeout" /var/log/rippled/debug.log | tail -30

Validation-specific messages

sudo grep -i "validation|proposing|consensus" /var/log/rippled/debug.log | tail -30
```

  • "Validation: ..." - Normal validation activity
  • "Peer connected" - New peer connections
  • "Ledger ..." - Ledger processing
  • "Resource limit" - Approaching limits
  • "Slow query" - Performance concern
  • "Peer disconnected" - Normal but watch frequency
  • "Database error" - Potential corruption
  • "Out of memory" - Resource exhaustion
  • "Failed to connect" - Network issues
  • "Invalid signature" - Possible attack or bug
# Create timeline of events
sudo grep -E "$(date '+%Y-%m-%d')" /var/log/rippled/debug.log | \
    grep -iE "error|warning|state|validation|connect" | \
    head -100 > /tmp/timeline.txt

Look for patterns:

- What happened before the error?

- Did state change?

- Were there connection issues?

- What was the sequence of events?



# Post-Incident Report
  • **Incident ID:** INC-YYYY-MM-DD-XXX
  • **Severity:** P1/P2/P3/P4
  • **Duration:** Start time to resolution time
  • **Impact:** What was affected
Time Event
HH:MM Issue detected
HH:MM Investigation started
HH:MM Root cause identified
HH:MM Fix applied
HH:MM Service restored
HH:MM Incident closed

[Detailed explanation of what caused the incident]

[What was done to fix the issue]

  • Duration of outage: X minutes
  • Validations missed: Approximately Y
  • Reputation impact: [Assessment]
  1. [What we learned]
  2. [What we learned]

[What will prevent recurrence]
```

Root Cause Analysis Using Five Whys:

Problem: Validator was down for 2 hours

Why 1: Why was the validator down?
→ rippled crashed due to out of memory

Why 2: Why did it run out of memory?
→ Memory usage grew beyond available RAM

Why 3: Why did memory usage grow?
→ node_size was set to "huge" but server only has 64GB

Why 4: Why was node_size set incorrectly?
→ Configuration was copied from documentation example

Why 5: Why wasn't this caught earlier?
→ No monitoring alert for memory usage trend

Root Cause: Missing memory trend monitoring
Action: Add memory usage trend alerting at 70% threshold
Track Improvements Over Time:

- Mean Time To Detect (MTTD)
- Mean Time To Resolve (MTTR)
- Incident frequency by type
- Repeat incidents (same root cause)

- MTTD < 5 minutes
- MTTR < 30 minutes for P2
- Zero repeat incidents for same cause
- Decreasing incident frequency

- Are we improving?
- What patterns emerge?
- What investments would help?

---
Escalate When:

- Issue beyond your expertise
- Resolution taking too long
- Impact is high or growing
- Need additional resources

- Potential network-wide issue
- Need input from rippled developers
- Security vulnerability discovered
- Unusual behavior affecting others

- If on UNLs and extended outage
- If you discover network-wide issue
- If security concern affects others
Document Your Escalation Path:

- Use troubleshooting guides
- Check documentation
- Review logs

- XRPL Discord #validators channel
- Community operator contacts
- Hosting provider support

- rippled GitHub issues (for bugs)
- Direct developer contact (if established)
- Security disclosure channels (for vulnerabilities)

- Update contact information regularly
- Test communication channels
- Know response times

---

Systematic troubleshooting is faster - Methodical approach beats random attempts

Documentation accelerates resolution - Past incidents inform current troubleshooting

Post-incident analysis prevents repeats - Root cause analysis leads to prevention

Preparation reduces panic - Pre-written procedures execute better under pressure

⚠️ Every incident is unique - Past patterns help but don't guarantee solutions

⚠️ Root causes can be complex - Simple "five whys" may not capture full picture

⚠️ Prevention isn't always possible - Some incidents from external factors

📌 Untested procedures fail under pressure - Practice your incident response

📌 Incomplete documentation - Missing information extends resolution time

📌 No escalation path - Being stuck alone on a critical issue

📌 Ignoring post-incident analysis - Same problems recur without learning

You will have incidents. The measure of a professional operator isn't zero incidents—it's fast detection, calm response, effective resolution, and genuine learning.

Build your troubleshooting muscle in low-stakes situations (testnet, minor issues) so you're prepared for high-stakes moments. Document everything. Learn from every incident. Over time, you'll handle issues that would have panicked you before.


Assignment: Build and test your incident response capability.

Requirements:

  • Create diagnostic script covering all common checks

  • Document common problems and solutions

  • Build decision tree for your environment

  • Test diagnostic procedures

  • Document emergency response procedures

  • Create quick-reference commands

  • Define escalation path

  • Test emergency procedures on testnet

  • Create post-incident report template

  • Document root cause analysis method

  • Define improvement tracking process

  • Create action item tracking system

  • Conduct simulated incident on testnet

  • Follow your procedures

  • Complete post-incident report

  • Identify procedure improvements

  • PDF or Markdown document

  • Scripts and procedures

  • Completed simulation report

  • Updated procedures based on simulation

  • Comprehensive diagnostic toolkit (25%)

  • Tested emergency procedures (25%)

  • Complete post-incident process (25%)

  • Realistic simulation with learning (25%)

Time investment: 4-6 hours
Value: Tested incident response capability


1. Troubleshooting Approach (Tests Methodology):

What should be your FIRST step when investigating a validator issue?

A) Restart the service
B) Gather information about what's happening
C) Ask for help on Discord
D) Check recent configuration changes

Correct Answer: B
Explanation: Before taking any action, gather information: What are the symptoms? When did it start? What's the current state? This information guides effective troubleshooting. Restarting without understanding may hide the problem temporarily or make diagnosis harder.


2. Common Problem (Tests Technical Knowledge):

Your validator shows server_state "full" instead of "proposing" after restart. What's the most likely cause?

A) Network connectivity issues
B) Validator token not loaded from configuration
C) Insufficient memory
D) Clock synchronization problems

Correct Answer: B
Explanation: "full" state indicates the server is synchronized but not validating. This usually means the validator token isn't configured correctly—either missing from config, malformed, or not loaded. Check for pubkey_validator in server_info and verify [validator_token] section in configuration.


3. Incident Response (Tests Process Knowledge):

During a P1 incident, what should you do after stopping the immediate damage?

A) Immediately begin post-incident report
B) Preserve evidence before making further changes
C) Restart all services
D) Notify the community

Correct Answer: B
Explanation: After containing the incident, preserve evidence (logs, state, configuration) before making changes that might destroy diagnostic information. This enables proper root cause analysis later. Notification and documentation come after preservation.


4. Post-Incident Analysis (Tests Understanding):

What is the primary goal of post-incident analysis?

A) Assign blame for the incident
B) Document the timeline for compliance
C) Prevent recurrence by understanding root cause
D) Calculate the financial impact

Correct Answer: C
Explanation: Post-incident analysis focuses on prevention, not blame. By understanding the true root cause (not just the immediate trigger), we can implement changes that prevent the same or similar incidents. This improves overall reliability over time.


5. Escalation (Tests Judgment):

When should you escalate to the broader validator community?

A) For any issue you can't solve in 5 minutes
B) When you discover an issue that may affect other validators or the network
C) When your validator has been down for more than 1 hour
D) Only when explicitly required by Ripple

Correct Answer: B
Explanation: Escalate to the community when the issue may not be isolated to your validator—potential network-wide issues, security vulnerabilities, or unusual behavior others should know about. Your individual downtime isn't a community concern unless you're on UNLs, but network-affecting issues are.


  • Google SRE Book - Incident Management chapters
  • PagerDuty Incident Response documentation
  • Post-incident review best practices
  • Linux system administration guides
  • rippled GitHub issues (common problems)
  • XRPL Discord troubleshooting discussions
  • Five Whys methodology
  • Fishbone diagrams
  • Blameless post-mortems

For Next Lesson:
With troubleshooting capability established, Lesson 16 will cover domain verification and trust building—the steps toward being recognized as a legitimate, trustworthy validator operator.


End of Lesson 15

Total words: ~5,500
Estimated completion time: 60 minutes reading + 4-6 hours implementation and simulation

Key Takeaways

1

Follow systematic methodology

—detect, assess, hypothesize, test, resolve, document; this structure prevents panic-driven random attempts.

2

Know your diagnostic commands

—server_info, peers, logs, system resources; quick information gathering enables quick resolution.

3

Classify incidents by severity

—P1 requires immediate response, P4 can wait; appropriate response based on impact.

4

Document incidents thoroughly

—timeline, root cause, resolution, lessons learned; this knowledge prevents repeat incidents.

5

Conduct post-incident analysis

—every incident is a learning opportunity; the goal is preventing recurrence, not assigning blame. ---