Routine Maintenance Procedures
Learning Objectives
Establish a maintenance schedule covering all routine validator tasks
Execute software updates safely with minimal downtime
Manage logs and database growth to prevent disk exhaustion
Apply security updates promptly while maintaining stability
Document maintenance activities for operational history
The best incident is one that never happens. Regular maintenance prevents:
What Maintenance Prevents:
- Log rotation prevents full disks
- Database cleanup maintains free space
- Monitoring alerts before critical
- Timely updates patch vulnerabilities
- Review catches misconfigurations
- Audit reveals unauthorized changes
- Resource monitoring catches trends
- Cleanup prevents accumulation
- Tuning maintains efficiency
- Proactive updates vs. emergency patches
- Planned restarts vs. crashes
- Scheduled maintenance windows
Maintenance is scheduled, controlled change—far better than unscheduled, uncontrolled failures.
---
Maintenance Types:
- Health check review
- Alert response
- Quick status verification
- Detailed health audit
- Log review
- Resource trend analysis
- Security update check
- Comprehensive system audit
- Performance analysis
- Backup verification
- Documentation update
- Capacity planning
- Security review
- Procedure testing
- Long-term trend analysis
- Software updates (when released)
- Security patches (urgent)
- Incident response
- Configuration changes
Monthly Maintenance Calendar:
- Monday: Weekly health audit
- Ongoing: Daily monitoring
- Monday: Weekly health audit
- Wednesday: Security update review
- Ongoing: Daily monitoring
- Monday: Weekly health audit
- Friday: Monthly comprehensive audit
- Ongoing: Daily monitoring
- Monday: Weekly health audit
- Thursday: Backup verification
- Friday: Documentation update
- Ongoing: Daily monitoring
Scheduling Maintenance:
- Low-traffic periods (varies by use case)
- Your awake/available hours
- Not during major network events
- Peak usage times
- When you can't monitor aftermath
- Multiple changes at once
- Right before vacations/unavailability
- Network doesn't have "low traffic" per se
- Choose times you can monitor closely
- Avoid when major announcements expected
- Have rollback plan ready
---
rippled Update Types:
- Significant changes
- May include breaking changes
- Thorough testing required
- Longer observation period
- New features, improvements
- Should be backward compatible
- Standard testing required
- Normal observation period
- Bug fixes
- Security patches
- Minimal testing needed
- Can expedite if security-critical
# Create update procedure script
sudo nano /opt/ripple/bin/update-rippled.sh#!/bin/bash
#===============================================================================
# rippled Update Procedure
# Safe update with rollback capability
#===============================================================================
set -e # Exit on error
echo "=============================================="
echo "rippled Update Procedure"
echo "Started: $(date)"
echo "=============================================="
Pre-flight checks
echo ""
echo "=== Pre-Update Checks ==="
Record current version
CURRENT_VERSION=$(/opt/ripple/bin/rippled --version | head -1)
echo "Current version: $CURRENT_VERSION"
Check current state
STATE=$(/opt/ripple/bin/rippled server_info 2>/dev/null | grep -o '"server_state" : "[^"]*"' | cut -d'"' -f4)
echo "Current state: $STATE"
if [ "$STATE" != "proposing" ]; then
echo "WARNING: Not in proposing state. Continue? (y/n)"
read -r response
[ "$response" != "y" ] && exit 1
fi
Backup configuration
echo ""
echo "=== Backing Up Configuration ==="
BACKUP_DIR="/opt/ripple/backups/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"
cp /opt/ripple/etc/rippled.cfg "$BACKUP_DIR/"
cp /opt/ripple/etc/validators.txt "$BACKUP_DIR/" 2>/dev/null || true
echo "Backed up to: $BACKUP_DIR"
Stop service
echo ""
echo "=== Stopping rippled ==="
sudo systemctl stop rippled
sleep 5
Update package
echo ""
echo "=== Updating Package ==="
sudo apt update
sudo apt install --only-upgrade rippled -y
Start service
echo ""
echo "=== Starting rippled ==="
sudo systemctl start rippled
Wait for startup
echo ""
echo "=== Waiting for Synchronization (120 seconds) ==="
sleep 120
Post-update verification
echo ""
echo "=== Post-Update Verification ==="
NEW_VERSION=$(/opt/ripple/bin/rippled --version | head -1)
echo "New version: $NEW_VERSION"
NEW_STATE=$(/opt/ripple/bin/rippled server_info 2>/dev/null | grep -o '"server_state" : "[^"]*"' | cut -d'"' -f4)
echo "New state: $NEW_STATE"
PEERS=$(/opt/ripple/bin/rippled server_info 2>/dev/null | grep -o '"peers" : [0-9]*' | awk '{print $3}')
echo "Peers: $PEERS"
Check success
if [ "$NEW_STATE" = "proposing" ] || [ "$NEW_STATE" = "full" ]; then
echo ""
echo "=== UPDATE SUCCESSFUL ==="
echo "Old version: $CURRENT_VERSION"
echo "New version: $NEW_VERSION"
echo "State: $NEW_STATE"
else
echo ""
echo "=== WARNING: Not in expected state ==="
echo "Current state: $NEW_STATE"
echo "Consider rollback if state doesn't improve"
fi
echo ""
echo "=============================================="
echo "Update complete: $(date)"
echo "Monitor closely for next 24 hours"
echo "=============================================="
```
chmod +x /opt/ripple/bin/update-rippled.sh# Create rollback script
sudo nano /opt/ripple/bin/rollback-rippled.sh#!/bin/bash
#===============================================================================
# rippled Rollback Procedure
# Revert to previous version if update fails
#===============================================================================
echo "=============================================="
echo "rippled Rollback Procedure"
echo "=============================================="
Check available versions
echo ""
echo "=== Available Versions ==="
apt-cache policy rippled
echo ""
echo "Enter version to install (e.g., 2.0.0-1):"
read -r VERSION
if [ -z "$VERSION" ]; then
echo "No version specified. Exiting."
exit 1
fi
Confirm
echo ""
echo "This will install rippled version: $VERSION"
echo "Continue? (y/n)"
read -r response
[ "$response" != "y" ] && exit 1
Stop service
echo ""
echo "=== Stopping rippled ==="
sudo systemctl stop rippled
Install specific version
echo ""
echo "=== Installing Version $VERSION ==="
sudo apt install rippled="$VERSION" -y
Restore configuration if needed
echo ""
echo "=== Check if configuration restore needed ==="
ls -la /opt/ripple/backups/ | tail -5
echo "Restore from backup? (y/n)"
read -r restore
if [ "$restore" = "y" ]; then
echo "Enter backup directory name:"
read -r backup_dir
cp "/opt/ripple/backups/$backup_dir/rippled.cfg" /opt/ripple/etc/
fi
Start service
echo ""
echo "=== Starting rippled ==="
sudo systemctl start rippled
echo ""
echo "=== Rollback Complete ==="
echo "Monitor server state and verify operation"
```
chmod +x /opt/ripple/bin/rollback-rippled.sh# Configure logrotate for rippled
sudo nano /etc/logrotate.d/rippled/var/log/rippled/*.log {
daily
missingok
rotate 14
compress
delaycompress
notifempty
create 0640 rippled rippled
sharedscripts
postrotate
systemctl reload rippled > /dev/null 2>&1 || true
endscript
}# Create log cleanup script
sudo nano /opt/ripple/bin/cleanup-logs.sh#!/bin/bash
#===============================================================================
# Log Cleanup Script
# Removes old logs and maintains disk space
#===============================================================================
LOG_DIR="/var/log/rippled"
METRICS_DIR="/var/log/validator-metrics"
RETENTION_DAYS=14
echo "Log Cleanup - $(date)"
echo "========================"
rippled logs
echo "Cleaning rippled logs older than $RETENTION_DAYS days..."
find "$LOG_DIR" -name ".log." -mtime +$RETENTION_DAYS -delete
find "$LOG_DIR" -name "*.gz" -mtime +$RETENTION_DAYS -delete
Metrics files
echo "Cleaning metrics files older than $RETENTION_DAYS days..."
find "$METRICS_DIR" -name "*.json" -mtime +$RETENTION_DAYS -delete
Health check logs
echo "Cleaning health check logs..."
find /var/log -name "validator-health*.log" -mtime +$RETENTION_DAYS -delete
Journal cleanup
echo "Cleaning systemd journal..."
sudo journalctl --vacuum-time=${RETENTION_DAYS}d
Report disk usage
echo ""
echo "Current disk usage:"
df -h /var/log
df -h /var/lib/rippled
echo ""
echo "Cleanup complete"
```
chmod +x /opt/ripple/bin/cleanup-logs.sh
Schedule weekly cleanup
Add to crontab:
0 3 * * 0 /opt/ripple/bin/cleanup-logs.sh >> /var/log/cleanup.log 2>&1
```
# Create log analysis script
sudo nano /opt/ripple/bin/analyze-logs.sh#!/bin/bash
#===============================================================================
# Log Analysis Script
# Weekly review of rippled logs
#===============================================================================
LOG_FILE="/var/log/rippled/debug.log"
echo "=============================================="
echo "Log Analysis Report - $(date)"
echo "=============================================="
echo ""
echo "=== Error Summary (Last 7 Days) ==="
sudo grep -i "error" "$LOG_FILE" | grep -E "$(date -d '7 days ago' '+%Y-%m')" |
awk '{print $NF}' | sort | uniq -c | sort -rn | head -10
echo ""
echo "=== Warning Summary (Last 7 Days) ==="
sudo grep -i "warning" "$LOG_FILE" | grep -E "$(date -d '7 days ago' '+%Y-%m')" |
awk '{print $NF}' | sort | uniq -c | sort -rn | head -10
echo ""
echo "=== Connection Issues ==="
sudo grep -i "disconnect|connection" "$LOG_FILE" | tail -20
echo ""
echo "=== Validation Messages (Sample) ==="
sudo grep -i "validation" "$LOG_FILE" | tail -10
echo ""
echo "=== Log File Sizes ==="
ls -lh /var/log/rippled/
echo ""
echo "=============================================="
```
chmod +x /opt/ripple/bin/analyze-logs.shDatabase Growth Management:
- Controls ledger history retention
- Higher values = more history = more disk
- Lower values = less history = less disk
- 256: ~20 minutes of history, minimal disk
- 512: ~30 minutes, small footprint
- 2048: ~2 hours, moderate disk
- 32768: ~1 day, significant disk
- 512-2048 typically sufficient
- Full history not required for validation
- Balance history needs with disk space
# Create database monitoring script
sudo nano /opt/ripple/bin/monitor-database.sh#!/bin/bash
#===============================================================================
# Database Space Monitoring
#===============================================================================
DB_PATH="/var/lib/rippled/db"
echo "Database Space Report - $(date)"
echo "================================"
Total database size
echo ""
echo "=== Database Size ==="
du -sh "$DB_PATH"
Breakdown by directory
echo ""
echo "=== Directory Breakdown ==="
du -sh "$DB_PATH"/*
Disk space available
echo ""
echo "=== Disk Space ==="
df -h "$DB_PATH"
Growth rate (compare to yesterday)
TODAY_SIZE=$(du -sb "$DB_PATH" | awk '{print $1}')
YESTERDAY_FILE="/var/run/db_size_yesterday"
if [ -f "$YESTERDAY_FILE" ]; then
YESTERDAY_SIZE=$(cat "$YESTERDAY_FILE")
GROWTH=$((TODAY_SIZE - YESTERDAY_SIZE))
GROWTH_MB=$((GROWTH / 1024 / 1024))
echo ""
echo "=== Growth Since Yesterday ==="
echo "Growth: ${GROWTH_MB} MB"
fi
echo "$TODAY_SIZE" > "$YESTERDAY_FILE"
Projection
DISK_FREE=$(df "$DB_PATH" | tail -1 | awk '{print $4}')
if [ "$GROWTH" -gt 0 ]; then
DAYS_REMAINING=$((DISK_FREE * 1024 / GROWTH))
echo ""
echo "=== Projection ==="
echo "At current growth rate: ~$DAYS_REMAINING days until disk full"
fi
```
chmod +x /opt/ripple/bin/monitor-database.sh# If database needs cleanup (changing online_delete)
# This requires careful procedure:
1. Update configuration
sudo nano /opt/ripple/etc/rippled.cfg
Change online_delete to lower value
2. Restart rippled
sudo systemctl restart rippled
3. Monitor deletion progress
Deletion happens gradually, not immediately
WARNING: Don't set online_delete too low
Very low values can cause issues during network stress
# Create security update script
sudo nano /opt/ripple/bin/security-updates.sh#!/bin/bash
#===============================================================================
# Security Update Script
# Apply security updates with minimal risk
#===============================================================================
echo "Security Update Check - $(date)"
echo "=================================="
Check for security updates
echo ""
echo "=== Available Security Updates ==="
apt list --upgradable 2>/dev/null | grep -i security
Count updates
SECURITY_UPDATES=$(apt list --upgradable 2>/dev/null | grep -ci security)
if [ "$SECURITY_UPDATES" -gt 0 ]; then
echo ""
echo "Found $SECURITY_UPDATES security updates"
echo ""
echo "Apply security updates? (y/n)"
read -r response
if [ "$response" = "y" ]; then
echo ""
echo "=== Applying Security Updates ==="
sudo apt update
sudo apt upgrade -y
echo ""
echo "=== Checking if Reboot Required ==="
if [ -f /var/run/reboot-required ]; then
echo "REBOOT REQUIRED"
echo "Schedule reboot at convenient time"
else
echo "No reboot required"
fi
fi
else
echo ""
echo "No security updates available"
fi
echo ""
echo "Update check complete"
```
chmod +x /opt/ripple/bin/security-updates.sh# Configure automatic security updates
sudo apt install unattended-upgrades -y
sudo dpkg-reconfigure -plow unattended-upgrades
Verify configuration
cat /etc/apt/apt.conf.d/50unattended-upgrades
```
Monthly Security Tasks:
Week 1:
□ Review failed login attempts
□ Check for unauthorized users
□ Verify SSH configuration
□ Review firewall rules
Week 2:
□ Run security scanner (Lynis)
□ Review AIDE integrity report
□ Check certificate expiration
□ Audit cron jobs
Week 3:
□ Review security updates status
□ Check for new vulnerabilities
□ Verify backup encryption
□ Test incident response
Week 4:
□ Update security documentation
□ Review access permissions
□ Check monitoring coverage
□ Plan next month's tasks
# Maintenance Log
- Date: YYYY-MM-DD
- Time: HH:MM
- Operator: Name
- Type: [Routine/Update/Emergency/Security]
- Duration: X minutes
- Description: What was done
- Outcome: Result
- Notes: Additional observations
After Each Maintenance:
1. Review Procedure
1. Update Documentation
1. Capture Lessons
Track Over Time:
- Time spent on maintenance (hours/month)
- Planned vs. unplanned maintenance ratio
- Update success rate
- Mean time between incidents
- Increasing maintenance time → investigate
- More unplanned maintenance → prevention needed
- Update failures → process improvement
- Frequent incidents → root cause analysis
---
Daily Maintenance (5-10 minutes):
□ Review monitoring dashboard
□ Check for alerts
□ Verify server state = "proposing"
□ Quick resource check (disk, memory)
□ Review any automated reports
If Issues Found:
□ Document in maintenance log
□ Investigate and resolve
□ Update monitoring if needed
Weekly Maintenance (30-60 minutes):
□ Detailed health audit
□ Log analysis (errors, warnings)
□ Resource trend review
□ Security update check
□ Peer connectivity analysis
□ Backup verification spot-check
□ Documentation review
Deliverables:
□ Weekly status summary
□ Any issues documented
□ Next week's planned maintenance
Monthly Maintenance (2-4 hours):
□ Comprehensive system audit
□ Full security review
□ Backup restoration test
□ Performance analysis
□ Capacity planning review
□ Documentation update
□ Procedure verification
□ Incident review (if any)
Deliverables:
□ Monthly maintenance report
□ Updated documentation
□ Next month's maintenance plan
□ Any improvement recommendations
✅ Scheduled maintenance prevents emergencies - Regular updates and cleanup prevent accumulation of issues
✅ Documentation enables consistency - Written procedures ensure maintenance is done correctly regardless of who performs it
✅ Log management prevents disk exhaustion - Without rotation and cleanup, logs fill disks
✅ Security updates are essential - Timely patching prevents exploitation of known vulnerabilities
⚠️ Optimal maintenance frequency - Balance between thoroughness and time investment varies by situation
⚠️ Best update timing - Depends on your monitoring capability and availability
⚠️ Automation extent - Some maintenance benefits from human review; full automation may miss issues
📌 Skipping maintenance - Deferred maintenance accumulates until something breaks
📌 Updates without testing - Direct mainnet updates risk outages
📌 No rollback plan - Updates without rollback capability are high-risk
📌 Undocumented changes - Changes without documentation become troubleshooting obstacles
Maintenance isn't exciting, but it's essential. A validator that receives consistent, documented maintenance will outlast and outperform one that's only touched when something breaks.
Start with the basics: daily quick checks, weekly reviews, monthly audits. As you build comfort, refine your procedures. The goal is sustainable operation—maintenance that's routine, not heroic.
Assignment: Establish a comprehensive maintenance framework for your validator.
Requirements:
Create daily, weekly, monthly checklists
Define maintenance windows
Schedule recurring tasks
Document responsible parties
Document update procedure (script or steps)
Create rollback procedure
Define testing requirements
Document notification process
Configure log rotation
Create cleanup scripts
Document retention policies
Verify disk space monitoring
Create log template
Document recent maintenance activities
Track metrics (time, outcomes)
Plan upcoming maintenance
PDF or Markdown document
Scripts and configurations
Completed checklists
Sample log entries
Comprehensive maintenance schedule (25%)
Working update procedures (25%)
Proper log management (25%)
Documented maintenance activities (25%)
Time investment: 4-6 hours
Value: Sustainable maintenance framework for long-term operation
1. Update Procedure (Tests Process Knowledge):
What should you do BEFORE applying a rippled update to your mainnet validator?
A) Nothing—just apply the update
B) Notify other validators
C) Test the update on testnet and observe for 24-48 hours
D) Back up the database
Correct Answer: C
Explanation: Updates should always be tested on testnet first, with observation for at least 24-48 hours to identify any issues. Only after successful testnet operation should you apply to mainnet. This prevents applying problematic updates to production.
2. Log Management (Tests Technical Knowledge):
Why is log rotation important for validator operation?
A) It makes logs easier to read
B) It prevents disk exhaustion which would cause validator failure
C) It's required by XRPL protocol
D) It improves validation speed
Correct Answer: B
Explanation: Without log rotation, logs grow indefinitely until they fill the disk. A full disk causes rippled to fail, taking down your validator. Log rotation limits log size and removes old logs, preventing disk exhaustion.
3. Maintenance Frequency (Tests Operational Understanding):
What is an appropriate frequency for a comprehensive system audit?
A) Daily
B) Weekly
C) Monthly
D) Annually
Correct Answer: C
Explanation: Monthly comprehensive audits provide thorough review without being excessive. Daily is too frequent for deep audits, weekly is appropriate for regular health checks, and annually is too infrequent to catch developing issues.
4. Rollback Capability (Tests Risk Management):
Why should you maintain rollback capability for updates?
A) To save disk space
B) To revert to a previous version if the update causes problems
C) To comply with regulations
D) To improve performance
Correct Answer: B
Explanation: Rollback capability allows you to quickly revert to a known-good state if an update causes issues. Without rollback, a problematic update requires troubleshooting under pressure. With rollback, you can restore service quickly and troubleshoot at leisure.
5. Documentation Value (Tests Process Understanding):
What is the primary benefit of maintaining a maintenance log?
A) Regulatory compliance
B) Enables trend analysis, troubleshooting, and knowledge transfer
C) Reduces maintenance time
D) Automates maintenance tasks
Correct Answer: B
Explanation: Maintenance logs document what was done, when, and with what outcome. This enables trend analysis (is maintenance time increasing?), troubleshooting (what changed before this problem?), and knowledge transfer (new operators can understand history).
- Linux system maintenance best practices
- Log management strategies
- Change management procedures
- Ansible for maintenance automation
- Cron best practices
- Systemd timer documentation
- Runbook creation guides
- Change log best practices
- IT documentation standards
For Next Lesson:
With maintenance procedures established, Lesson 15 will cover troubleshooting and incident response—how to diagnose and resolve issues when they occur.
End of Lesson 14
Total words: ~5,100
Estimated completion time: 50 minutes reading + 4-6 hours implementation
Key Takeaways
Scheduled maintenance prevents unscheduled outages
—regular updates, cleanup, and review prevent accumulation of issues that cause incidents.
Test updates on testnet first
—never apply updates directly to mainnet; verify on testnet and observe before production deployment.
Maintain rollback capability
—every update should have a documented rollback procedure in case of problems.
Log rotation prevents disk exhaustion
—configure automatic log rotation and periodic cleanup to prevent storage issues.
Document all maintenance activities
—maintenance logs enable trend analysis, troubleshooting, and knowledge transfer. ---