Troubleshooting and Incident Response
Learning Objectives
Apply systematic troubleshooting methodology to validator issues
Diagnose common validator problems using logs, metrics, and diagnostic commands
Execute incident response procedures under pressure
Conduct post-incident analysis to prevent recurrence
Document incidents for operational learning and improvement
Every validator operator will face incidents. The goal isn't perfection—it's resilience:
Incident Reality:
- Software bugs will affect your server
- Network issues will disrupt connectivity
- Hardware will eventually fail
- Human error will cause problems
- External factors will create challenges
- Detection time (minutes vs. hours)
- Response quality (systematic vs. panicked)
- Resolution speed (prepared vs. improvised)
- Learning captured (improved vs. repeated)
- Documentation (complete vs. missing)
This lesson prepares you for the inevitable.
---
Troubleshooting Framework:
1. DETECT
1. ASSESS
1. HYPOTHESIZE
1. TEST
1. RESOLVE
1. DOCUMENT
# Initial diagnostic commands
# Run these first when investigating issues
1. Is rippled running?
systemctl status rippled
pgrep -a rippled
2. What's the server state?
/opt/ripple/bin/rippled server_info 2>&1 | head -50
3. Recent log entries
sudo tail -100 /var/log/rippled/debug.log
4. System resources
free -h
df -h
top -bn1 | head -20
5. Network status
ss -tlnp | grep rippled
/opt/ripple/bin/rippled peers | head -30
```
Start Here: What's the symptom?
Symptom: Server not responding
├── Is rippled process running?
│ ├── No → Check why it stopped (logs, OOM, crash)
│ └── Yes → Check if RPC responding
│ ├── No → Check admin port binding
│ └── Yes → Proceed to state check
Symptom: Not in "proposing" state
├── What state is it in?
│ ├── "full" → Token not loaded or key issue
│ ├── "syncing" → Synchronization problem
│ ├── "tracking" → Partial sync, connectivity?
│ └── "connected" → Peer/network issues
Symptom: Low peer count
├── Is firewall correct?
│ ├── No → Fix firewall rules
│ └── Yes → Check external reachability
│ ├── Port blocked → Network/ISP issue
│ └── Port open → Check fixed peers
Symptom: High ledger age
├── Check peer count
│ ├── Low → Fix connectivity first
│ └── Normal → Check resource usage
│ ├── High CPU/Memory → Resource issue
│ └── Normal → Network latency?
```
- Service fails to start
- Process exits immediately
- Errors in journal
Diagnostic:
# Check service status
sudo systemctl status rippled
Check journal for errors
sudo journalctl -u rippled --no-pager | tail -50
Try starting manually for more output
sudo /opt/ripple/bin/rippled --conf /opt/ripple/etc/rippled.cfg --fg
```
Common Causes and Solutions:
Symptom: "Error in configuration file"
Solution: Check rippled.cfg syntax, validate JSON sections
Symptom: "Address already in use"
Solution: Check for zombie process, kill if needed
Symptom: "Database error" or similar
Solution: May need to delete and resync
Symptom: OOM killer messages in dmesg
Solution: Reduce node_size or add memory
Symptom: "Permission denied" errors
Solution: Check file ownership
- State stuck at "full"
- No pubkey_validator in server_info
- Validation not occurring
Diagnostic:
# Check for validator token
/opt/ripple/bin/rippled server_info | grep pubkey_validator
Check configuration
sudo grep -A5 "validator_token" /opt/ripple/etc/rippled.cfg
```
Common Causes and Solutions:
Symptom: No pubkey_validator field
Solution: Add [validator_token] section to config
Symptom: Errors mentioning token in logs
Solution: Regenerate token, ensure complete copy
Symptom: pubkey_validator doesn't match expected
Solution: Generate token from correct master key
Symptom: Changes not taking effect
Solution: Restart rippled after config changes
- Ledger age continuously high
- State fluctuating
- "syncing" state for extended periods
Diagnostic:
# Check ledger age
/opt/ripple/bin/rippled server_info | grep -A5 "validated_ledger"
Check peer quality
/opt/ripple/bin/rippled peers | grep -E "latency|status"
Check for network issues
ping -c 10 r.ripple.com
```
Common Causes and Solutions:
Symptom: High latency or packet loss
Solution: Check network path, consider different peers
Symptom: Low peer count (<5)
Solution: Check firewall, add fixed peers
Symptom: High CPU, memory, or IO
Solution: Check resources, consider hardware upgrade
Symptom: Time-related errors in logs
Solution: Verify NTP synchronization
- Server becomes slow
- OOM killer invoked
- Swap usage high
Diagnostic:
# Check memory usage
free -h
Check swap
swapon --show
Check rippled memory
ps aux | grep rippled | awk '{print $4, $6}'
Check for OOM events
dmesg | grep -i "out of memory"
```
Solutions:
Edit rippled.cfg: node_size = medium (instead of large)
Trades performance for memory
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Note: Swap is not ideal for validators
Long-term solution
64 GB recommended for validators
Keeps less history
Reduces memory pressure
- Disk full errors
- rippled failing to write
- Performance degradation
Diagnostic:
# Check disk usage
df -h
Find large files
sudo du -sh /var/lib/rippled/db/*
sudo du -sh /var/log/rippled/*
Check for log growth
ls -lh /var/log/rippled/
```
Solutions:
sudo /opt/ripple/bin/cleanup-logs.sh
Verify logrotate is working
Lower online_delete in config
Restart rippled
Cloud: Resize volume
Bare metal: Add disk
sudo systemctl stop rippled
sudo rm -rf /var/lib/rippled/db/*
sudo systemctl start rippled
Note: Requires full resync
Severity Levels:
- Validator completely down
- Security breach suspected
- Data integrity at risk
- Response: Immediate, all hands
- Validator not validating
- Significant performance degradation
- Security concern (not active breach)
- Response: Within 1 hour
- Degraded performance
- Non-critical issues
- Warning-level alerts
- Response: Same business day
- Minor issues
- Cosmetic problems
- Enhancement requests
- Response: Next maintenance window
# Create incident response checklist
cat > ~/incident-response-checklist.md << 'EOF'
# Incident Response Checklist
- [ ] Complete incident report
- [ ] Schedule post-mortem (if P1/P2)
- [ ] Update documentation
- [ ] Implement preventive measures
Complete Validator Failure:
#!/bin/bash
# Emergency recovery procedure
echo "=== EMERGENCY RECOVERY ==="
echo "Time: $(date)"
# 1. Check if process is running
if pgrep -x rippled > /dev/null; then
echo "rippled is running - checking state"
/opt/ripple/bin/rippled server_info | grep server_state
else
echo "rippled NOT running - attempting restart"
sudo systemctl restart rippled
sleep 30
fi
# 2. Check state after restart
STATE=$(/opt/ripple/bin/rippled server_info 2>/dev/null | grep -o '"server_state" : "[^"]*"' | cut -d'"' -f4)
if [ "$STATE" = "proposing" ]; then
echo "RECOVERED: Validator is proposing"
elif [ "$STATE" = "full" ] || [ "$STATE" = "syncing" ]; then
echo "PARTIAL: Validator is $STATE - monitor recovery"
else
echo "CRITICAL: Unexpected state '$STATE' - manual intervention needed"
fi
Security Incident:
#!/bin/bash
# Security incident response
echo "=== SECURITY INCIDENT RESPONSE ==="
echo "Time: $(date)"
# 1. Stop validator to prevent further damage
echo "Stopping rippled..."
sudo systemctl stop rippled
# 2. Preserve evidence
echo "Preserving logs..."
INCIDENT_DIR="/tmp/incident_$(date +%Y%m%d_%H%M%S)"
mkdir -p "$INCIDENT_DIR"
cp -r /var/log/rippled "$INCIDENT_DIR/"
cp /var/log/auth.log "$INCIDENT_DIR/"
cp /opt/ripple/etc/rippled.cfg "$INCIDENT_DIR/"
# 3. Check for unauthorized access
echo "Checking for unauthorized access..."
last | head -20 > "$INCIDENT_DIR/last.txt"
who > "$INCIDENT_DIR/who.txt"
ps aux > "$INCIDENT_DIR/ps.txt"
netstat -plant > "$INCIDENT_DIR/netstat.txt"
# 4. Instructions
echo ""
echo "Evidence preserved to: $INCIDENT_DIR"
echo ""
echo "NEXT STEPS:"
echo "1. Analyze preserved evidence"
echo "2. Determine scope of compromise"
echo "3. If token compromised: Generate new token"
echo "4. If server compromised: Rebuild from scratch"
echo "5. Document incident thoroughly"
# Search for errors in time range
sudo grep -E "$(date -d '1 hour ago' '+%Y-%m-%d %H')" /var/log/rippled/debug.log | grep -i error
Find all unique error types
sudo grep -i error /var/log/rippled/debug.log | awk '{print $NF}' | sort | uniq -c | sort -rn
Track state changes
sudo grep -i "server_state|state change" /var/log/rippled/debug.log | tail -50
Find connection issues
sudo grep -i "disconnect|connection|timeout" /var/log/rippled/debug.log | tail -30
Validation-specific messages
sudo grep -i "validation|proposing|consensus" /var/log/rippled/debug.log | tail -30
```
- "Validation: ..." - Normal validation activity
- "Peer connected" - New peer connections
- "Ledger ..." - Ledger processing
- "Resource limit" - Approaching limits
- "Slow query" - Performance concern
- "Peer disconnected" - Normal but watch frequency
- "Database error" - Potential corruption
- "Out of memory" - Resource exhaustion
- "Failed to connect" - Network issues
- "Invalid signature" - Possible attack or bug
# Create timeline of events
sudo grep -E "$(date '+%Y-%m-%d')" /var/log/rippled/debug.log | \
grep -iE "error|warning|state|validation|connect" | \
head -100 > /tmp/timeline.txt
Look for patterns:
- What happened before the error?
- Did state change?
- Were there connection issues?
- What was the sequence of events?
# Post-Incident Report
- **Incident ID:** INC-YYYY-MM-DD-XXX
- **Severity:** P1/P2/P3/P4
- **Duration:** Start time to resolution time
- **Impact:** What was affected
| Time | Event |
|---|---|
| HH:MM | Issue detected |
| HH:MM | Investigation started |
| HH:MM | Root cause identified |
| HH:MM | Fix applied |
| HH:MM | Service restored |
| HH:MM | Incident closed |
[Detailed explanation of what caused the incident]
[What was done to fix the issue]
- Duration of outage: X minutes
- Validations missed: Approximately Y
- Reputation impact: [Assessment]
- [What we learned]
- [What we learned]
[What will prevent recurrence]
```
Root Cause Analysis Using Five Whys:
Problem: Validator was down for 2 hours
Why 1: Why was the validator down?
→ rippled crashed due to out of memory
Why 2: Why did it run out of memory?
→ Memory usage grew beyond available RAM
Why 3: Why did memory usage grow?
→ node_size was set to "huge" but server only has 64GB
Why 4: Why was node_size set incorrectly?
→ Configuration was copied from documentation example
Why 5: Why wasn't this caught earlier?
→ No monitoring alert for memory usage trend
Root Cause: Missing memory trend monitoring
Action: Add memory usage trend alerting at 70% threshold
Track Improvements Over Time:
- Mean Time To Detect (MTTD)
- Mean Time To Resolve (MTTR)
- Incident frequency by type
- Repeat incidents (same root cause)
- MTTD < 5 minutes
- MTTR < 30 minutes for P2
- Zero repeat incidents for same cause
- Decreasing incident frequency
- Are we improving?
- What patterns emerge?
- What investments would help?
---
Escalate When:
- Issue beyond your expertise
- Resolution taking too long
- Impact is high or growing
- Need additional resources
- Potential network-wide issue
- Need input from rippled developers
- Security vulnerability discovered
- Unusual behavior affecting others
- If on UNLs and extended outage
- If you discover network-wide issue
- If security concern affects others
Document Your Escalation Path:
- Use troubleshooting guides
- Check documentation
- Review logs
- XRPL Discord #validators channel
- Community operator contacts
- Hosting provider support
- rippled GitHub issues (for bugs)
- Direct developer contact (if established)
- Security disclosure channels (for vulnerabilities)
- Update contact information regularly
- Test communication channels
- Know response times
---
✅ Systematic troubleshooting is faster - Methodical approach beats random attempts
✅ Documentation accelerates resolution - Past incidents inform current troubleshooting
✅ Post-incident analysis prevents repeats - Root cause analysis leads to prevention
✅ Preparation reduces panic - Pre-written procedures execute better under pressure
⚠️ Every incident is unique - Past patterns help but don't guarantee solutions
⚠️ Root causes can be complex - Simple "five whys" may not capture full picture
⚠️ Prevention isn't always possible - Some incidents from external factors
📌 Untested procedures fail under pressure - Practice your incident response
📌 Incomplete documentation - Missing information extends resolution time
📌 No escalation path - Being stuck alone on a critical issue
📌 Ignoring post-incident analysis - Same problems recur without learning
You will have incidents. The measure of a professional operator isn't zero incidents—it's fast detection, calm response, effective resolution, and genuine learning.
Build your troubleshooting muscle in low-stakes situations (testnet, minor issues) so you're prepared for high-stakes moments. Document everything. Learn from every incident. Over time, you'll handle issues that would have panicked you before.
Assignment: Build and test your incident response capability.
Requirements:
Create diagnostic script covering all common checks
Document common problems and solutions
Build decision tree for your environment
Test diagnostic procedures
Document emergency response procedures
Create quick-reference commands
Define escalation path
Test emergency procedures on testnet
Create post-incident report template
Document root cause analysis method
Define improvement tracking process
Create action item tracking system
Conduct simulated incident on testnet
Follow your procedures
Complete post-incident report
Identify procedure improvements
PDF or Markdown document
Scripts and procedures
Completed simulation report
Updated procedures based on simulation
Comprehensive diagnostic toolkit (25%)
Tested emergency procedures (25%)
Complete post-incident process (25%)
Realistic simulation with learning (25%)
Time investment: 4-6 hours
Value: Tested incident response capability
1. Troubleshooting Approach (Tests Methodology):
What should be your FIRST step when investigating a validator issue?
A) Restart the service
B) Gather information about what's happening
C) Ask for help on Discord
D) Check recent configuration changes
Correct Answer: B
Explanation: Before taking any action, gather information: What are the symptoms? When did it start? What's the current state? This information guides effective troubleshooting. Restarting without understanding may hide the problem temporarily or make diagnosis harder.
2. Common Problem (Tests Technical Knowledge):
Your validator shows server_state "full" instead of "proposing" after restart. What's the most likely cause?
A) Network connectivity issues
B) Validator token not loaded from configuration
C) Insufficient memory
D) Clock synchronization problems
Correct Answer: B
Explanation: "full" state indicates the server is synchronized but not validating. This usually means the validator token isn't configured correctly—either missing from config, malformed, or not loaded. Check for pubkey_validator in server_info and verify [validator_token] section in configuration.
3. Incident Response (Tests Process Knowledge):
During a P1 incident, what should you do after stopping the immediate damage?
A) Immediately begin post-incident report
B) Preserve evidence before making further changes
C) Restart all services
D) Notify the community
Correct Answer: B
Explanation: After containing the incident, preserve evidence (logs, state, configuration) before making changes that might destroy diagnostic information. This enables proper root cause analysis later. Notification and documentation come after preservation.
4. Post-Incident Analysis (Tests Understanding):
What is the primary goal of post-incident analysis?
A) Assign blame for the incident
B) Document the timeline for compliance
C) Prevent recurrence by understanding root cause
D) Calculate the financial impact
Correct Answer: C
Explanation: Post-incident analysis focuses on prevention, not blame. By understanding the true root cause (not just the immediate trigger), we can implement changes that prevent the same or similar incidents. This improves overall reliability over time.
5. Escalation (Tests Judgment):
When should you escalate to the broader validator community?
A) For any issue you can't solve in 5 minutes
B) When you discover an issue that may affect other validators or the network
C) When your validator has been down for more than 1 hour
D) Only when explicitly required by Ripple
Correct Answer: B
Explanation: Escalate to the community when the issue may not be isolated to your validator—potential network-wide issues, security vulnerabilities, or unusual behavior others should know about. Your individual downtime isn't a community concern unless you're on UNLs, but network-affecting issues are.
- Google SRE Book - Incident Management chapters
- PagerDuty Incident Response documentation
- Post-incident review best practices
- Linux system administration guides
- rippled GitHub issues (common problems)
- XRPL Discord troubleshooting discussions
- Five Whys methodology
- Fishbone diagrams
- Blameless post-mortems
For Next Lesson:
With troubleshooting capability established, Lesson 16 will cover domain verification and trust building—the steps toward being recognized as a legitimate, trustworthy validator operator.
End of Lesson 15
Total words: ~5,500
Estimated completion time: 60 minutes reading + 4-6 hours implementation and simulation
Key Takeaways
Follow systematic methodology
—detect, assess, hypothesize, test, resolve, document; this structure prevents panic-driven random attempts.
Know your diagnostic commands
—server_info, peers, logs, system resources; quick information gathering enables quick resolution.
Classify incidents by severity
—P1 requires immediate response, P4 can wait; appropriate response based on impact.
Document incidents thoroughly
—timeline, root cause, resolution, lessons learned; this knowledge prevents repeat incidents.
Conduct post-incident analysis
—every incident is a learning opportunity; the goal is preventing recurrence, not assigning blame. ---