advanced•60 min

Troubleshooting and Incident Response

Name: Running an XRPL Validator
Price: 29 USD
Availability: InStock

Learning Objectives

Apply systematic troubleshooting methodology to validator issues

Diagnose common validator problems using logs, metrics, and diagnostic commands

Execute incident response procedures under pressure

Conduct post-incident analysis to prevent recurrence

Document incidents for operational learning and improvement

Every validator operator will face incidents. The goal isn't perfection—it's resilience:

Incident Reality:

- Software bugs will affect your server
- Network issues will disrupt connectivity
- Hardware will eventually fail
- Human error will cause problems
- External factors will create challenges

- Detection time (minutes vs. hours)
- Response quality (systematic vs. panicked)
- Resolution speed (prepared vs. improvised)
- Learning captured (improved vs. repeated)
- Documentation (complete vs. missing)

This lesson prepares you for the inevitable.

---

Troubleshooting Framework:

1. DETECT

1. ASSESS

1. HYPOTHESIZE

1. TEST

1. RESOLVE

1. DOCUMENT

# Initial diagnostic commands
# Run these first when investigating issues

1. Is rippled running?

systemctl status rippled
pgrep -a rippled

2. What's the server state?

/opt/ripple/bin/rippled server_info 2>&1 | head -50

3. Recent log entries

sudo tail -100 /var/log/rippled/debug.log

4. System resources

free -h
df -h
top -bn1 | head -20

5. Network status

ss -tlnp | grep rippled
/opt/ripple/bin/rippled peers | head -30
```

Start Here: What's the symptom?

Symptom: Server not responding
├── Is rippled process running?
│ ├── No → Check why it stopped (logs, OOM, crash)
│ └── Yes → Check if RPC responding
│ ├── No → Check admin port binding
│ └── Yes → Proceed to state check

Symptom: Not in "proposing" state
├── What state is it in?
│ ├── "full" → Token not loaded or key issue
│ ├── "syncing" → Synchronization problem
│ ├── "tracking" → Partial sync, connectivity?
│ └── "connected" → Peer/network issues

Symptom: Low peer count
├── Is firewall correct?
│ ├── No → Fix firewall rules
│ └── Yes → Check external reachability
│ ├── Port blocked → Network/ISP issue
│ └── Port open → Check fixed peers

Symptom: High ledger age
├── Check peer count
│ ├── Low → Fix connectivity first
│ └── Normal → Check resource usage
│ ├── High CPU/Memory → Resource issue
│ └── Normal → Network latency?
```

Service fails to start
Process exits immediately
Errors in journal

Diagnostic:

# Check service status
sudo systemctl status rippled

Check journal for errors

sudo journalctl -u rippled --no-pager | tail -50

Try starting manually for more output

sudo /opt/ripple/bin/rippled --conf /opt/ripple/etc/rippled.cfg --fg
```

Common Causes and Solutions:

Symptom: "Error in configuration file"
Solution: Check rippled.cfg syntax, validate JSON sections
Symptom: "Address already in use"
Solution: Check for zombie process, kill if needed
Symptom: "Database error" or similar
Solution: May need to delete and resync
Symptom: OOM killer messages in dmesg
Solution: Reduce node_size or add memory
Symptom: "Permission denied" errors
Solution: Check file ownership

State stuck at "full"
No pubkey_validator in server_info
Validation not occurring

Diagnostic:

# Check for validator token
/opt/ripple/bin/rippled server_info | grep pubkey_validator

Check configuration

sudo grep -A5 "validator_token" /opt/ripple/etc/rippled.cfg
```

Common Causes and Solutions:

Symptom: No pubkey_validator field
Solution: Add [validator_token] section to config
Symptom: Errors mentioning token in logs
Solution: Regenerate token, ensure complete copy
Symptom: pubkey_validator doesn't match expected
Solution: Generate token from correct master key
Symptom: Changes not taking effect
Solution: Restart rippled after config changes

Ledger age continuously high
State fluctuating
"syncing" state for extended periods

Diagnostic:

# Check ledger age
/opt/ripple/bin/rippled server_info | grep -A5 "validated_ledger"

Check peer quality

/opt/ripple/bin/rippled peers | grep -E "latency|status"

Check for network issues

ping -c 10 r.ripple.com
```

Common Causes and Solutions:

Symptom: High latency or packet loss
Solution: Check network path, consider different peers
Symptom: Low peer count (<5)
Solution: Check firewall, add fixed peers
Symptom: High CPU, memory, or IO
Solution: Check resources, consider hardware upgrade
Symptom: Time-related errors in logs
Solution: Verify NTP synchronization

Server becomes slow
OOM killer invoked
Swap usage high

Diagnostic:

# Check memory usage
free -h

Check swap

swapon --show

Check rippled memory

ps aux | grep rippled | awk '{print $4, $6}'

Check for OOM events

dmesg | grep -i "out of memory"
```

Solutions:

Edit rippled.cfg: node_size = medium (instead of large)
Trades performance for memory
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Note: Swap is not ideal for validators
Long-term solution
64 GB recommended for validators
Keeps less history
Reduces memory pressure

Disk full errors
rippled failing to write
Performance degradation

Diagnostic:

# Check disk usage
df -h

Find large files

sudo du -sh /var/lib/rippled/db/*
sudo du -sh /var/log/rippled/*

Check for log growth

ls -lh /var/log/rippled/
```

Solutions:

sudo /opt/ripple/bin/cleanup-logs.sh
Verify logrotate is working
Lower online_delete in config
Restart rippled
Cloud: Resize volume
Bare metal: Add disk
sudo systemctl stop rippled
sudo rm -rf /var/lib/rippled/db/*
sudo systemctl start rippled
Note: Requires full resync

Severity Levels:

- Validator completely down
- Security breach suspected
- Data integrity at risk
- Response: Immediate, all hands

- Validator not validating
- Significant performance degradation
- Security concern (not active breach)
- Response: Within 1 hour

- Degraded performance
- Non-critical issues
- Warning-level alerts
- Response: Same business day

- Minor issues
- Cosmetic problems
- Enhancement requests
- Response: Next maintenance window

# Create incident response checklist
cat > ~/incident-response-checklist.md << 'EOF'
# Incident Response Checklist

[ ] Complete incident report
[ ] Schedule post-mortem (if P1/P2)
[ ] Update documentation
[ ] Implement preventive measures

Complete Validator Failure:

#!/bin/bash
# Emergency recovery procedure

echo "=== EMERGENCY RECOVERY ==="
echo "Time: $(date)"

# 1. Check if process is running
if pgrep -x rippled > /dev/null; then
    echo "rippled is running - checking state"
    /opt/ripple/bin/rippled server_info | grep server_state
else
    echo "rippled NOT running - attempting restart"
    sudo systemctl restart rippled
    sleep 30
fi

# 2. Check state after restart
STATE=$(/opt/ripple/bin/rippled server_info 2>/dev/null | grep -o '"server_state" : "[^"]*"' | cut -d'"' -f4)

if [ "$STATE" = "proposing" ]; then
    echo "RECOVERED: Validator is proposing"
elif [ "$STATE" = "full" ] || [ "$STATE" = "syncing" ]; then
    echo "PARTIAL: Validator is $STATE - monitor recovery"
else
    echo "CRITICAL: Unexpected state '$STATE' - manual intervention needed"
fi

Security Incident:

#!/bin/bash
# Security incident response

echo "=== SECURITY INCIDENT RESPONSE ==="
echo "Time: $(date)"

# 1. Stop validator to prevent further damage
echo "Stopping rippled..."
sudo systemctl stop rippled

# 2. Preserve evidence
echo "Preserving logs..."
INCIDENT_DIR="/tmp/incident_$(date +%Y%m%d_%H%M%S)"
mkdir -p "$INCIDENT_DIR"
cp -r /var/log/rippled "$INCIDENT_DIR/"
cp /var/log/auth.log "$INCIDENT_DIR/"
cp /opt/ripple/etc/rippled.cfg "$INCIDENT_DIR/"

# 3. Check for unauthorized access
echo "Checking for unauthorized access..."
last | head -20 > "$INCIDENT_DIR/last.txt"
who > "$INCIDENT_DIR/who.txt"
ps aux > "$INCIDENT_DIR/ps.txt"
netstat -plant > "$INCIDENT_DIR/netstat.txt"

# 4. Instructions
echo ""
echo "Evidence preserved to: $INCIDENT_DIR"
echo ""
echo "NEXT STEPS:"
echo "1. Analyze preserved evidence"
echo "2. Determine scope of compromise"
echo "3. If token compromised: Generate new token"
echo "4. If server compromised: Rebuild from scratch"
echo "5. Document incident thoroughly"

# Search for errors in time range
sudo grep -E "$(date -d '1 hour ago' '+%Y-%m-%d %H')" /var/log/rippled/debug.log | grep -i error

Find all unique error types

sudo grep -i error /var/log/rippled/debug.log | awk '{print $NF}' | sort | uniq -c | sort -rn

Track state changes

sudo grep -i "server_state|state change" /var/log/rippled/debug.log | tail -50

Find connection issues

sudo grep -i "disconnect|connection|timeout" /var/log/rippled/debug.log | tail -30

Validation-specific messages

sudo grep -i "validation|proposing|consensus" /var/log/rippled/debug.log | tail -30
```

"Validation: ..." - Normal validation activity
"Peer connected" - New peer connections
"Ledger ..." - Ledger processing

"Resource limit" - Approaching limits
"Slow query" - Performance concern
"Peer disconnected" - Normal but watch frequency

"Database error" - Potential corruption
"Out of memory" - Resource exhaustion
"Failed to connect" - Network issues
"Invalid signature" - Possible attack or bug

# Create timeline of events
sudo grep -E "$(date '+%Y-%m-%d')" /var/log/rippled/debug.log | \
    grep -iE "error|warning|state|validation|connect" | \
    head -100 > /tmp/timeline.txt

Look for patterns:

- What happened before the error?

- Did state change?

- Were there connection issues?

- What was the sequence of events?

# Post-Incident Report

**Incident ID:** INC-YYYY-MM-DD-XXX
**Severity:** P1/P2/P3/P4
**Duration:** Start time to resolution time
**Impact:** What was affected

Time	Event
HH:MM	Issue detected
HH:MM	Investigation started
HH:MM	Root cause identified
HH:MM	Fix applied
HH:MM	Service restored
HH:MM	Incident closed

[Detailed explanation of what caused the incident]

[What was done to fix the issue]

Duration of outage: X minutes
Validations missed: Approximately Y
Reputation impact: [Assessment]

[What we learned]
[What we learned]

[What will prevent recurrence]
```

Root Cause Analysis Using Five Whys:

Problem: Validator was down for 2 hours

Why 1: Why was the validator down?
→ rippled crashed due to out of memory

Why 2: Why did it run out of memory?
→ Memory usage grew beyond available RAM

Why 3: Why did memory usage grow?
→ node_size was set to "huge" but server only has 64GB

Why 4: Why was node_size set incorrectly?
→ Configuration was copied from documentation example

Why 5: Why wasn't this caught earlier?
→ No monitoring alert for memory usage trend

Root Cause: Missing memory trend monitoring
Action: Add memory usage trend alerting at 70% threshold

Track Improvements Over Time:

- Mean Time To Detect (MTTD)
- Mean Time To Resolve (MTTR)
- Incident frequency by type
- Repeat incidents (same root cause)

- MTTD < 5 minutes
- MTTR < 30 minutes for P2
- Zero repeat incidents for same cause
- Decreasing incident frequency

- Are we improving?
- What patterns emerge?
- What investments would help?

---

Escalate When:

- Issue beyond your expertise
- Resolution taking too long
- Impact is high or growing
- Need additional resources

- Potential network-wide issue
- Need input from rippled developers
- Security vulnerability discovered
- Unusual behavior affecting others

- If on UNLs and extended outage
- If you discover network-wide issue
- If security concern affects others

Document Your Escalation Path:

- Use troubleshooting guides
- Check documentation
- Review logs

- XRPL Discord #validators channel
- Community operator contacts
- Hosting provider support

- rippled GitHub issues (for bugs)
- Direct developer contact (if established)
- Security disclosure channels (for vulnerabilities)

- Update contact information regularly
- Test communication channels
- Know response times

---

✅ Systematic troubleshooting is faster - Methodical approach beats random attempts

✅ Documentation accelerates resolution - Past incidents inform current troubleshooting

✅ Post-incident analysis prevents repeats - Root cause analysis leads to prevention

✅ Preparation reduces panic - Pre-written procedures execute better under pressure

⚠️ Every incident is unique - Past patterns help but don't guarantee solutions

⚠️ Root causes can be complex - Simple "five whys" may not capture full picture

⚠️ Prevention isn't always possible - Some incidents from external factors

📌 Untested procedures fail under pressure - Practice your incident response

📌 Incomplete documentation - Missing information extends resolution time

📌 No escalation path - Being stuck alone on a critical issue

📌 Ignoring post-incident analysis - Same problems recur without learning

You will have incidents. The measure of a professional operator isn't zero incidents—it's fast detection, calm response, effective resolution, and genuine learning.

Build your troubleshooting muscle in low-stakes situations (testnet, minor issues) so you're prepared for high-stakes moments. Document everything. Learn from every incident. Over time, you'll handle issues that would have panicked you before.

Assignment: Build and test your incident response capability.

Requirements:

Create diagnostic script covering all common checks
Document common problems and solutions
Build decision tree for your environment
Test diagnostic procedures
Document emergency response procedures
Create quick-reference commands
Define escalation path
Test emergency procedures on testnet
Create post-incident report template
Document root cause analysis method
Define improvement tracking process
Create action item tracking system
Conduct simulated incident on testnet
Follow your procedures
Complete post-incident report
Identify procedure improvements
PDF or Markdown document
Scripts and procedures
Completed simulation report
Updated procedures based on simulation
Comprehensive diagnostic toolkit (25%)
Tested emergency procedures (25%)
Complete post-incident process (25%)
Realistic simulation with learning (25%)

Time investment: 4-6 hours
Value: Tested incident response capability

1. Troubleshooting Approach (Tests Methodology):

What should be your FIRST step when investigating a validator issue?

A) Restart the service
B) Gather information about what's happening
C) Ask for help on Discord
D) Check recent configuration changes

Correct Answer: B
Explanation: Before taking any action, gather information: What are the symptoms? When did it start? What's the current state? This information guides effective troubleshooting. Restarting without understanding may hide the problem temporarily or make diagnosis harder.

2. Common Problem (Tests Technical Knowledge):

Your validator shows server_state "full" instead of "proposing" after restart. What's the most likely cause?

A) Network connectivity issues
B) Validator token not loaded from configuration
C) Insufficient memory
D) Clock synchronization problems

Correct Answer: B
Explanation: "full" state indicates the server is synchronized but not validating. This usually means the validator token isn't configured correctly—either missing from config, malformed, or not loaded. Check for pubkey_validator in server_info and verify [validator_token] section in configuration.

3. Incident Response (Tests Process Knowledge):

During a P1 incident, what should you do after stopping the immediate damage?

A) Immediately begin post-incident report
B) Preserve evidence before making further changes
C) Restart all services
D) Notify the community

Correct Answer: B
Explanation: After containing the incident, preserve evidence (logs, state, configuration) before making changes that might destroy diagnostic information. This enables proper root cause analysis later. Notification and documentation come after preservation.

4. Post-Incident Analysis (Tests Understanding):

What is the primary goal of post-incident analysis?

A) Assign blame for the incident
B) Document the timeline for compliance
C) Prevent recurrence by understanding root cause
D) Calculate the financial impact

Correct Answer: C
Explanation: Post-incident analysis focuses on prevention, not blame. By understanding the true root cause (not just the immediate trigger), we can implement changes that prevent the same or similar incidents. This improves overall reliability over time.

5. Escalation (Tests Judgment):

When should you escalate to the broader validator community?

A) For any issue you can't solve in 5 minutes
B) When you discover an issue that may affect other validators or the network
C) When your validator has been down for more than 1 hour
D) Only when explicitly required by Ripple

Correct Answer: B
Explanation: Escalate to the community when the issue may not be isolated to your validator—potential network-wide issues, security vulnerabilities, or unusual behavior others should know about. Your individual downtime isn't a community concern unless you're on UNLs, but network-affecting issues are.

Google SRE Book - Incident Management chapters
PagerDuty Incident Response documentation
Post-incident review best practices

Linux system administration guides
rippled GitHub issues (common problems)
XRPL Discord troubleshooting discussions

Five Whys methodology
Fishbone diagrams
Blameless post-mortems

For Next Lesson:
With troubleshooting capability established, Lesson 16 will cover domain verification and trust building—the steps toward being recognized as a legitimate, trustworthy validator operator.

End of Lesson 15

Total words: ~5,500
Estimated completion time: 60 minutes reading + 4-6 hours implementation and simulation

Key Takeaways

Follow systematic methodology

—detect, assess, hypothesize, test, resolve, document; this structure prevents panic-driven random attempts.

Know your diagnostic commands

—server_info, peers, logs, system resources; quick information gathering enables quick resolution.

Classify incidents by severity

—P1 requires immediate response, P4 can wait; appropriate response based on impact.

Document incidents thoroughly

—timeline, root cause, resolution, lessons learned; this knowledge prevents repeat incidents.

Conduct post-incident analysis

—every incident is a learning opportunity; the goal is preventing recurrence, not assigning blame. ---

Learning Objectives

Introduction: Incidents Are Inevitable

Section 1: Troubleshooting Methodology

1. Is rippled running?

2. What's the server state?

3. Recent log entries

4. System resources

5. Network status

Section 2: Common Problems and Solutions

Check journal for errors

Try starting manually for more output

Check configuration

Check peer quality

Check for network issues

Check swap

Check rippled memory

Check for OOM events

Find large files

Check for log growth

Section 3: Incident Response Procedures

Immediate (First 5 Minutes)

Investigation (5-30 Minutes)

Resolution

Post-Incident

Section 4: Log Analysis Techniques

Find all unique error types

Track state changes

Find connection issues

Validation-specific messages

Look for patterns:

- What happened before the error?

- Did state change?

- Were there connection issues?

- What was the sequence of events?

Section 5: Post-Incident Analysis

Incident Summary

Timeline

Root Cause

Resolution

Impact Analysis

Lessons Learned

Prevention Measures

Section 6: Escalation Procedures

Critical Analysis

Deliverable: Incident Response Capability

Assessment Questions

Further Reading & Sources

Key Takeaways