Monitoring and Alerting Systems | Running an XRPL Validator | XRP Academy - XRP Academy
3 free lessons remaining this month

Free preview access resets monthly

Upgrade for Unlimited
Skip to main content
advanced65 min

Monitoring and Alerting Systems

Learning Objectives

Design a comprehensive monitoring strategy covering all critical validator metrics

Implement automated health checks with appropriate intervals and thresholds

Configure alerting systems that notify you of issues without causing alert fatigue

Integrate with monitoring platforms (Prometheus, Grafana, or alternatives)

Create operational dashboards for at-a-glance status visibility

Every minute your validator operates incorrectly without your knowledge is potential reputation damage:

  • Issues discovered by community reports
  • Missed validations accumulate unnoticed
  • State drops go undetected
  • Resource exhaustion surprises you
  • Reputation damage before you know there's a problem
  • Issues discovered within minutes
  • Alerts prompt immediate investigation
  • Trends visible before they become problems
  • Planned maintenance instead of emergencies
  • Reputation protected by rapid response

Investment in monitoring pays dividends in avoided incidents.


Comprehensive Monitoring Covers:

- Server availability (is it reachable?)
- Resource usage (CPU, memory, disk, network)
- Hardware health (if applicable)

- Service status (is rippled running?)
- Process health (is it responsive?)
- Log errors (are there warnings?)

- Server state (is it proposing?)
- Synchronization (is ledger age low?)
- Peer connectivity (enough peers?)
- Validation activity (sending validations?)

- External reachability (can peers connect?)
- Validation visibility (are validations seen?)
- Agreement percentage (matching consensus?)
  • Server down
  • State not "proposing"
  • Ledger age > 30 seconds
  • Peer count < 5
  • Disk > 95%
  • State transitions
  • Ledger age > 10 seconds
  • Peer count < 10
  • Memory > 80%
  • Disk > 80%
  • Resource trends
  • Peer changes
  • Version information
  • Performance metrics
Check Frequency Guidelines:

- Server state
- Process running
- Basic health

- Ledger synchronization
- Peer count
- Resource usage

- Detailed health check
- Log analysis
- Performance metrics

- Trend analysis
- Capacity projections
- Security checks

- Comprehensive audit
- Report generation
- Backup verification

---
# Create comprehensive health check
sudo nano /opt/ripple/bin/comprehensive-health-check.sh
#!/bin/bash
#===============================================================================
# Comprehensive Validator Health Check
# Exit codes: 0=healthy, 1=warning, 2=critical
#===============================================================================

Configuration

ALERT_EMAIL="[email protected]"
CRITICAL_PEER_COUNT=5
WARNING_PEER_COUNT=10
CRITICAL_LEDGER_AGE=30
WARNING_LEDGER_AGE=10
CRITICAL_DISK_PERCENT=95
WARNING_DISK_PERCENT=80
CRITICAL_MEMORY_PERCENT=90
WARNING_MEMORY_PERCENT=80

Initialize status

OVERALL_STATUS=0
MESSAGES=""

Helper function

add_message() {
local level="$1"
local msg="$2"
MESSAGES="${MESSAGES}[$level] $msg\n"

case $level in
CRITICAL) [ $OVERALL_STATUS -lt 2 ] && OVERALL_STATUS=2 ;;
WARNING) [ $OVERALL_STATUS -lt 1 ] && OVERALL_STATUS=1 ;;
esac
}

#-------------------------------------------------------------------------------

Check 1: rippled Process

#-------------------------------------------------------------------------------
if ! pgrep -x "rippled" > /dev/null; then
add_message "CRITICAL" "rippled process not running"
else
add_message "OK" "rippled process running"
fi

#-------------------------------------------------------------------------------

Check 2: Server State

#-------------------------------------------------------------------------------
SERVER_INFO=$(/opt/ripple/bin/rippled server_info 2>/dev/null)
if [ $? -ne 0 ]; then
add_message "CRITICAL" "Cannot connect to rippled"
else
STATE=$(echo "$SERVER_INFO" | grep -o '"server_state" : "[^"]*"' | cut -d'"' -f4)

if [ "$STATE" = "proposing" ]; then
add_message "OK" "Server state: proposing"
elif [ "$STATE" = "full" ]; then
add_message "WARNING" "Server state: full (not validating)"
else
add_message "CRITICAL" "Server state: $STATE"
fi
fi

#-------------------------------------------------------------------------------

Check 3: Ledger Age

#-------------------------------------------------------------------------------
if [ -n "$SERVER_INFO" ]; then
LEDGER_AGE=$(echo "$SERVER_INFO" | grep -o '"age" : [0-9]*' | head -1 | awk '{print $3}')

if [ -n "$LEDGER_AGE" ]; then
if [ "$LEDGER_AGE" -ge "$CRITICAL_LEDGER_AGE" ]; then
add_message "CRITICAL" "Ledger age: ${LEDGER_AGE}s (threshold: ${CRITICAL_LEDGER_AGE}s)"
elif [ "$LEDGER_AGE" -ge "$WARNING_LEDGER_AGE" ]; then
add_message "WARNING" "Ledger age: ${LEDGER_AGE}s (threshold: ${WARNING_LEDGER_AGE}s)"
else
add_message "OK" "Ledger age: ${LEDGER_AGE}s"
fi
fi
fi

#-------------------------------------------------------------------------------

Check 4: Peer Count

#-------------------------------------------------------------------------------
if [ -n "$SERVER_INFO" ]; then
PEERS=$(echo "$SERVER_INFO" | grep -o '"peers" : [0-9]*' | awk '{print $3}')

if [ -n "$PEERS" ]; then
if [ "$PEERS" -lt "$CRITICAL_PEER_COUNT" ]; then
add_message "CRITICAL" "Peer count: $PEERS (threshold: ${CRITICAL_PEER_COUNT})"
elif [ "$PEERS" -lt "$WARNING_PEER_COUNT" ]; then
add_message "WARNING" "Peer count: $PEERS (threshold: ${WARNING_PEER_COUNT})"
else
add_message "OK" "Peer count: $PEERS"
fi
fi
fi

#-------------------------------------------------------------------------------

Check 5: Disk Usage

#-------------------------------------------------------------------------------
DISK_PERCENT=$(df /var/lib/rippled/db | tail -1 | awk '{print $5}' | sed 's/%//')
if [ "$DISK_PERCENT" -ge "$CRITICAL_DISK_PERCENT" ]; then
add_message "CRITICAL" "Disk usage: ${DISK_PERCENT}%"
elif [ "$DISK_PERCENT" -ge "$WARNING_DISK_PERCENT" ]; then
add_message "WARNING" "Disk usage: ${DISK_PERCENT}%"
else
add_message "OK" "Disk usage: ${DISK_PERCENT}%"
fi

#-------------------------------------------------------------------------------

Check 6: Memory Usage

#-------------------------------------------------------------------------------
MEM_PERCENT=$(free | grep Mem | awk '{printf "%.0f", $3/$2 * 100}')
if [ "$MEM_PERCENT" -ge "$CRITICAL_MEMORY_PERCENT" ]; then
add_message "CRITICAL" "Memory usage: ${MEM_PERCENT}%"
elif [ "$MEM_PERCENT" -ge "$WARNING_MEMORY_PERCENT" ]; then
add_message "WARNING" "Memory usage: ${MEM_PERCENT}%"
else
add_message "OK" "Memory usage: ${MEM_PERCENT}%"
fi

#-------------------------------------------------------------------------------

Check 7: Recent Errors in Logs

#-------------------------------------------------------------------------------
ERROR_COUNT=$(sudo grep -c "error|ERROR" /var/log/rippled/debug.log 2>/dev/null | tail -1)
RECENT_ERRORS=$(sudo grep -i "error" /var/log/rippled/debug.log 2>/dev/null | tail -5 | grep "$(date '+%Y-%m-%d')" | wc -l)

if [ "$RECENT_ERRORS" -gt 10 ]; then
add_message "WARNING" "Recent log errors: $RECENT_ERRORS today"
else
add_message "OK" "Recent log errors: $RECENT_ERRORS today"
fi

#-------------------------------------------------------------------------------

Output Results

#-------------------------------------------------------------------------------
echo "=============================================="
echo "Health Check - $(date)"
echo "=============================================="
echo -e "$MESSAGES"
echo "=============================================="
echo "Overall Status: $([ $OVERALL_STATUS -eq 0 ] && echo 'HEALTHY' || ([ $OVERALL_STATUS -eq 1 ] && echo 'WARNING' || echo 'CRITICAL'))"
echo "=============================================="

Send alert if not healthy

if [ $OVERALL_STATUS -gt 0 ]; then
SUBJECT="[VALIDATOR] $([ $OVERALL_STATUS -eq 1 ] && echo 'WARNING' || echo 'CRITICAL') - $(hostname)"
echo -e "$MESSAGES" | mail -s "$SUBJECT" "$ALERT_EMAIL" 2>/dev/null
fi

exit $OVERALL_STATUS
```

# Make executable
sudo chmod +x /opt/ripple/bin/comprehensive-health-check.sh

Test

sudo /opt/ripple/bin/comprehensive-health-check.sh
echo "Exit code: $?"
```

# Add to crontab
sudo crontab -e
  • * * * * /opt/ripple/bin/comprehensive-health-check.sh >> /var/log/validator-health.log 2>&1

Detailed check every 15 minutes (with longer output)

*/15 * * * * /opt/ripple/bin/comprehensive-health-check.sh -v >> /var/log/validator-health-detail.log 2>&1
```

# Create metrics collection for trending
sudo nano /opt/ripple/bin/collect-metrics.sh
#!/bin/bash
#===============================================================================
# Metrics Collection Script
# Collects metrics for trend analysis
#===============================================================================

METRICS_DIR="/var/log/validator-metrics"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
METRICS_FILE="$METRICS_DIR/metrics_$TIMESTAMP.json"

mkdir -p "$METRICS_DIR"

Collect metrics

SERVER_INFO=$(/opt/ripple/bin/rippled server_info 2>/dev/null)

if [ $? -eq 0 ]; then
# Extract metrics
STATE=$(echo "$SERVER_INFO" | grep -o '"server_state" : "[^"]"' | cut -d'"' -f4)
PEERS=$(echo "$SERVER_INFO" | grep -o '"peers" : [0-9]
' | awk '{print $3}')
LEDGER_AGE=$(echo "$SERVER_INFO" | grep -o '"age" : [0-9]' | head -1 | awk '{print $3}')
IO_LATENCY=$(echo "$SERVER_INFO" | grep -o '"io_latency_ms" : [0-9]
' | awk '{print $3}')
UPTIME=$(echo "$SERVER_INFO" | grep -o '"uptime" : [0-9]*' | awk '{print $3}')

System metrics

CPU_PERCENT=$(top -bn1 | grep rippled | awk '{print $9}' | head -1)
MEM_PERCENT=$(free | grep Mem | awk '{printf "%.1f", $3/$2 * 100}')
DISK_PERCENT=$(df /var/lib/rippled/db | tail -1 | awk '{print $5}' | sed 's/%//')

Create JSON

cat > "$METRICS_FILE" << EOF

{
"timestamp": "$(date -Iseconds)",
"server_state": "$STATE",
"peers": $PEERS,
"ledger_age": $LEDGER_AGE,
"io_latency_ms": ${IO_LATENCY:-0},
"uptime_seconds": $UPTIME,
"cpu_percent": ${CPU_PERCENT:-0},
"memory_percent": $MEM_PERCENT,
"disk_percent": $DISK_PERCENT
}
EOF
fi

Cleanup old metrics (keep 7 days)

find "$METRICS_DIR" -name "metrics_*.json" -mtime +7 -delete
```

chmod +x /opt/ripple/bin/collect-metrics.sh

Schedule collection every 5 minutes

Add to crontab:

*/5 * * * * /opt/ripple/bin/collect-metrics.sh
```


Alert Severity Levels:

- Validator down
- State not "proposing" for 5+ minutes
- Disk > 95%
- Process crashed

- State dropped from "proposing"
- Ledger age > 30 seconds
- Peer count < 5
- Memory > 90%

- Ledger age > 10 seconds
- Peer count < 10
- Disk > 80%
- Memory > 80%

- Minor log errors
- Resource trend changes
- Informational alerts
# Create alert routing script
sudo nano /opt/ripple/bin/send-alert.sh
#!/bin/bash
#===============================================================================
# Alert Routing Script
# Routes alerts based on severity
#===============================================================================

SEVERITY="$1"
MESSAGE="$2"
HOSTNAME=$(hostname)

send_email() {
local to="$1"
local subject="$2"
local body="$3"
echo "$body" | mail -s "$subject" "$to"
}

send_slack() {
local message="$1"
curl -s -X POST -H 'Content-type: application/json'
--data "{"text":"$message"}"
"$SLACK_WEBHOOK" > /dev/null 2>&1
}

send_pagerduty() {
local message="$1"
curl -s -X POST
-H "Content-Type: application/json"
-H "Authorization: Token token=$PAGERDUTY_KEY"
-d "{
"routing_key": "$PAGERDUTY_KEY",
"event_action": "trigger",
"payload": {
"summary": "$message",
"source": "$HOSTNAME",
"severity": "critical"
}
}"
"https://events.pagerduty.com/v2/enqueue" target="_blank" rel="noopener noreferrer" class="text-cyan-400 hover:text-cyan-300 underline hover:no-underline transition-colors inline-flex items-center gap-1">https://events.pagerduty.com/v2/enqueue">https://events.pagerduty.com/v2/enqueue " > /dev/null 2>&1
}

case $SEVERITY in
CRITICAL)
send_email "$EMAIL_CRITICAL" "[$HOSTNAME] CRITICAL: Validator Alert" "$MESSAGE"
send_slack ":rotating_light: CRITICAL [$HOSTNAME]: $MESSAGE"
send_pagerduty "$MESSAGE"
;;
HIGH)
send_email "$EMAIL_CRITICAL" "[$HOSTNAME] HIGH: Validator Alert" "$MESSAGE"
send_slack ":warning: HIGH [$HOSTNAME]: $MESSAGE"
;;
MEDIUM)
send_email "$EMAIL_WARNING" "[$HOSTNAME] MEDIUM: Validator Alert" "$MESSAGE"
send_slack ":information_source: MEDIUM [$HOSTNAME]: $MESSAGE"
;;
LOW)
# Just log, daily digest will pick up
logger -t validator-alert "[LOW] $MESSAGE"
;;
esac
```

chmod +x /opt/ripple/bin/send-alert.sh
# Add deduplication to prevent alert storms
sudo nano /opt/ripple/bin/dedupe-alert.sh
#!/bin/bash
#===============================================================================
# Alert Deduplication
# Prevents sending duplicate alerts within cooldown period
#===============================================================================

ALERT_KEY="$1"
COOLDOWN_SECONDS="${2:-300}" # Default 5 minute cooldown
ALERT_STATE_DIR="/var/run/validator-alerts"

mkdir -p "$ALERT_STATE_DIR"

STATE_FILE="$ALERT_STATE_DIR/$ALERT_KEY"

if [ -f "$STATE_FILE" ]; then
LAST_ALERT=$(cat "$STATE_FILE")
NOW=$(date +%s)
ELAPSED=$((NOW - LAST_ALERT))

if [ "$ELAPSED" -lt "$COOLDOWN_SECONDS" ]; then
# Within cooldown, don't alert
exit 1
fi
fi

Record this alert

date +%s > "$STATE_FILE"
exit 0
```

chmod +x /opt/ripple/bin/dedupe-alert.sh

Usage in health check:

if /opt/ripple/bin/dedupe-alert.sh "state_critical" 300; then

/opt/ripple/bin/send-alert.sh CRITICAL "Validator state critical"

fi



# Create Prometheus exporter for rippled
sudo nano /opt/ripple/bin/rippled-exporter.py
#!/usr/bin/env python3
"""
rippled Prometheus Exporter
Exposes rippled metrics in Prometheus format
"""

import subprocess
import json
import http.server
import socketserver
from urllib.parse import urlparse, parse_qs

PORT = 9090

def get_rippled_info():
"""Get server_info from rippled"""
try:
result = subprocess.run(
['/opt/ripple/bin/rippled', 'server_info'],
capture_output=True,
text=True,
timeout=10
)
return json.loads(result.stdout)
except Exception as e:
return None

def get_peer_count():
"""Get peer count from rippled"""
try:
result = subprocess.run(
['/opt/ripple/bin/rippled', 'peers'],
capture_output=True,
text=True,
timeout=10
)
data = json.loads(result.stdout)
return len(data.get('result', {}).get('peers', []))
except:
return 0

def generate_metrics():
"""Generate Prometheus metrics"""
metrics = []

info = get_rippled_info()
if info and 'result' in info and 'info' in info['result']:
data = info['result']['info']

Server state (1 = proposing, 0.5 = full, 0 = other)

    state = data.get('server_state', '')
    state_value = 1 if state == 'proposing' else (0.5 if state == 'full' else 0)
    metrics.append(f'rippled_server_state{{state="{state}"}} {state_value}')

Uptime

    uptime = data.get('uptime', 0)
    metrics.append(f'rippled_uptime_seconds {uptime}')

Peers

    peers = data.get('peers', 0)
    metrics.append(f'rippled_peers {peers}')

IO latency

    io_latency = data.get('io_latency_ms', 0)
    metrics.append(f'rippled_io_latency_ms {io_latency}')

Ledger age

    validated_ledger = data.get('validated_ledger', {})
    ledger_age = validated_ledger.get('age', 0)
    metrics.append(f'rippled_ledger_age_seconds {ledger_age}')

Ledger sequence

    ledger_seq = validated_ledger.get('seq', 0)
    metrics.append(f'rippled_ledger_sequence {ledger_seq}')

Load factor

    load_factor = data.get('load_factor', 1)
    metrics.append(f'rippled_load_factor {load_factor}')

Add metadata

metrics.append('# HELP rippled_server_state Server state (1=proposing, 0.5=full, 0=other)')
metrics.append('# TYPE rippled_server_state gauge')
metrics.append('# HELP rippled_uptime_seconds Server uptime in seconds')
metrics.append('# TYPE rippled_uptime_seconds counter')
metrics.append('# HELP rippled_peers Number of connected peers')
metrics.append('# TYPE rippled_peers gauge')

return '\n'.join(metrics)

class MetricsHandler(http.server.BaseHTTPRequestHandler):
def do_GET(self):
if self.path == '/metrics':
metrics = generate_metrics()
self.send_response(200)
self.send_header('Content-type', 'text/plain')
self.end_headers()
self.wfile.write(metrics.encode())
else:
self.send_response(404)
self.end_headers()

def log_message(self, format, *args):
pass # Suppress logging

if name == 'main':
with socketserver.TCPServer(("127.0.0.1", PORT), MetricsHandler) as httpd:
print(f"Serving metrics on port {PORT}")
httpd.serve_forever()
```

# Make executable and create service
sudo chmod +x /opt/ripple/bin/rippled-exporter.py

Create systemd service

sudo nano /etc/systemd/system/rippled-exporter.service
```

[Unit]
Description=rippled Prometheus Exporter
After=rippled.service

[Service]
Type=simple
ExecStart=/usr/bin/python3 /opt/ripple/bin/rippled-exporter.py
Restart=always
User=rippled

[Install]
WantedBy=multi-user.target
```

sudo systemctl daemon-reload
sudo systemctl enable rippled-exporter
sudo systemctl start rippled-exporter
# /etc/prometheus/prometheus.yml
scrape_configs:
  - job_name: 'rippled'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 30s
{
  "dashboard": {
    "title": "XRPL Validator Dashboard",
    "panels": [
      {
        "title": "Server State",
        "type": "stat",
        "targets": [
          {
            "expr": "rippled_server_state"
          }
        ]
      },
      {
        "title": "Peer Count",
        "type": "gauge",
        "targets": [
          {
            "expr": "rippled_peers"
          }
        ]
      },
      {
        "title": "Ledger Age",
        "type": "graph",
        "targets": [
          {
            "expr": "rippled_ledger_age_seconds"
          }
        ]
      },
      {
        "title": "Uptime",
        "type": "stat",
        "targets": [
          {
            "expr": "rippled_uptime_seconds / 3600"
          }
        ]
      }
    ]
  }
}

External Monitoring Options:

- HTTP check to your server
- Email/SMS alerts
- 5-minute intervals free

- More features
- Paid service
- Detailed analytics

- Free and paid tiers
- Multiple check types
- Good integration options

- External server checking your validator
- More control
- More maintenance
# Create simple health endpoint (if running web server)
# Or use rippled's built-in status

Check from external monitoring:

TCP check on port 51235 (peer port reachability)

HTTP check on local proxy if configured


Benefits of External Monitoring:

1. Network Perspective

1. Independence

1. Third-Party Validation

---
Dashboard Sections:

Section 1: Health Status (top, prominent)
┌─────────────────────────────────────────────┐
│  [STATE: Proposing ✓]  [PEERS: 15]  [AGE: 3s]  │
│  [CPU: 12%]  [MEM: 45%]  [DISK: 34%]          │
└─────────────────────────────────────────────┘

Section 2: Trend Graphs (middle)
┌─────────────────────────────────────────────┐
│  [Peer Count over 24h]  [Ledger Age over 24h]│
│  [Resource Usage over 24h]                   │
└─────────────────────────────────────────────┘

Section 3: Recent Events (bottom)
┌─────────────────────────────────────────────┐
│  [Recent Alerts]  [State Changes]  [Logs]    │
└─────────────────────────────────────────────┘
Alert Summary Panel:

Active Alerts: 0 ✓
Last 24h Alerts: 2
  - [12:34] WARNING: Peer count dropped to 8
  - [14:56] OK: Peer count recovered to 12

Last State Change: 3 days ago
Uptime: 99.97% (30 days)
Essential Metrics for Mobile:

1. Overall Status (Green/Yellow/Red)
2. Server State
3. Key Numbers:

Quick glance should answer:
"Is my validator healthy right now?"

Common Causes of Alert Fatigue:

- Alerting on transient conditions
- Low thresholds trigger constantly
- Every minor fluctuation alerts

- Same alert repeatedly
- No cooldown period
- Redundant checks

- Alerts you can't fix
- Informational alerts as pages
- Unclear what to do

- Tune thresholds based on baseline
- Implement deduplication
- Clear escalation procedures
- Review and adjust regularly
# Analyze baseline to set appropriate thresholds

Review metrics history

cat /var/log/validator-metrics/metrics_*.json |
jq -s 'map(.peers) | sort | .[length/2]'

Gets median peer count

Set thresholds at:

Warning: 20% below median

Critical: 50% below median

Example:

Median peers: 15

Warning: 12 (20% below)

Critical: 8 (about 50% below)


Weekly Alert Review:

1. Count alerts by type
2. Identify frequent alerters
3. Analyze false positives
4. Adjust thresholds as needed
5. Update documentation

- Was every alert actionable?
- Were any alerts missed?
- Were thresholds appropriate?
- Is there a pattern to alerts?

---

Monitoring enables faster incident response - Issues detected in minutes rather than hours

Alerting prevents extended outages - Automated notification ensures awareness

Dashboards provide operational visibility - At-a-glance status aids decision-making

Trend data enables proactive maintenance - Patterns visible before they become incidents

⚠️ Optimal thresholds vary - Baseline metrics differ by infrastructure; tuning required

⚠️ Best monitoring platform - Prometheus/Grafana work well but alternatives exist

⚠️ Alert routing complexity - Balance between noise and coverage requires iteration

📌 Monitoring without alerting - Dashboards alone aren't sufficient; need proactive notification

📌 Over-alerting causes fatigue - Too many alerts leads to ignoring them

📌 Under-monitoring critical metrics - Missing key indicators means missed incidents

📌 Single point of failure in monitoring - Monitoring system failure = blind operation

Start simple: a health check script on cron that emails you when something's wrong. This catches most issues. Expand to Prometheus/Grafana if you want dashboards and historical trending. The key is having something that alerts you—sophistication can come later.

A simple system that you actually monitor beats an elaborate system that you ignore.


Assignment: Implement comprehensive monitoring for your validator.

Requirements:

  • Implement comprehensive health check script

  • Configure scheduled execution (cron)

  • Test alert triggering

  • Document thresholds and rationale

  • Configure alert routing (email at minimum)

  • Implement deduplication

  • Test alert delivery

  • Document escalation procedures

  • Implement metrics collection script

  • Configure scheduled collection

  • Verify data storage

  • Create basic trend analysis capability

  • Create operational visibility (Grafana dashboard OR daily email report)

  • Include key metrics

  • Document access and usage

  • Test functionality

  • PDF or Markdown document

  • Scripts and configuration files

  • Screenshots of dashboard or report samples

  • Documentation

  • Functional health check system (30%)

  • Working alerting with deduplication (25%)

  • Metrics collection and storage (25%)

  • Operational visibility (20%)

Time investment: 6-8 hours
Value: Professional monitoring system protecting your validator reputation


1. Monitoring Intervals (Tests Design Understanding):

What is an appropriate interval for checking validator server state?

A) Once per hour
B) Once per minute
C) Once per day
D) Once per week

Correct Answer: B
Explanation: Server state should be checked frequently—once per minute or more often. A validator that drops out of "proposing" state needs immediate attention. Hourly or daily checks would allow significant reputation damage before detection.


2. Alert Deduplication (Tests Implementation Knowledge):

Why is alert deduplication important?

A) It saves money on SMS costs
B) It prevents alert fatigue from receiving the same alert repeatedly
C) It improves validator performance
D) It's required by monitoring platforms

Correct Answer: B
Explanation: Without deduplication, an ongoing issue generates repeated alerts—potentially hundreds. This causes alert fatigue, where operators start ignoring alerts. Deduplication sends one alert and suppresses duplicates for a cooldown period, maintaining alert effectiveness.


3. External Monitoring (Tests Strategy Understanding):

What unique value does external monitoring provide?

A) It's faster than internal monitoring
B) It shows what the network sees—detecting connectivity issues invisible from inside
C) It's more accurate than internal monitoring
D) It's cheaper than internal monitoring

Correct Answer: B
Explanation: External monitoring checks your validator from the network's perspective. If your peer port is unreachable due to firewall or network issues, internal monitoring won't detect it—the server thinks it's fine. External monitoring catches connectivity problems that internal checks miss.


4. Threshold Setting (Tests Operational Knowledge):

How should you set alert thresholds for peer count?

A) Use industry standard values
B) Base thresholds on your specific baseline metrics
C) Set as low as possible to catch all issues
D) Copy thresholds from documentation

Correct Answer: B
Explanation: Thresholds should be based on your specific baseline. A validator that typically has 20 peers should alert at different thresholds than one with 12 peers. Generic thresholds may be too sensitive (constant alerts) or too lenient (missing issues) for your specific situation.


5. Dashboard Priority (Tests Design Understanding):

What should be most prominent on a validator operational dashboard?

A) Historical graphs
B) Current health status (state, peers, ledger age)
C) Log entries
D) System configuration

Correct Answer: B
Explanation: Current health status should be most prominent—an operator glancing at the dashboard should immediately see if the validator is healthy. Historical graphs, logs, and configuration are valuable but secondary to answering "Is my validator healthy right now?"


  • Google SRE Book - Monitoring chapters
  • Alert fatigue research
  • Dashboard design principles

For Next Lesson:
With monitoring in place, Lesson 14 will cover routine maintenance procedures—the regular tasks that keep your validator running smoothly over time.


End of Lesson 13

Total words: ~5,800
Estimated completion time: 65 minutes reading + 6-8 hours implementation

Key Takeaways

1

Health checks every minute

catch issues quickly—automated checks at 1-minute intervals ensure rapid detection of problems.

2

Deduplication prevents alert storms

—implement cooldown periods to avoid receiving hundreds of alerts for a single issue.

3

External monitoring provides network perspective

—internal monitoring plus external checks give complete visibility.

4

Tune thresholds to your baseline

—generic thresholds may not match your infrastructure; adjust based on observed metrics.

5

Review alerts weekly

—regular review prevents alert fatigue and ensures thresholds remain appropriate. ---