Benchmarking & Performance Measurement - Trust But Verify | XRPL Performance & Scaling | XRP Academy - XRP Academy
3 free lessons remaining this month

Free preview access resets monthly

Upgrade for Unlimited
Skip to main content
advanced55 min

Benchmarking & Performance Measurement - Trust But Verify

Learning Objectives

Design meaningful benchmarks that test realistic XRPL workloads rather than synthetic best-case scenarios

Execute reproducible tests on testnet with proper methodology and statistical rigor

Interpret latency percentiles correctly—understanding why p99 matters more than average

Identify misleading claims in published performance data from any blockchain project

Establish monitoring baselines for production XRPL applications

In 2023, a Layer 1 blockchain announced "1,000,000 TPS" capability. The fine print: synthetic transactions, single data center, no actual state changes, pre-signed transactions, no network latency.

In production? 2,000 TPS with 40% success rate during high load.

The gap between claimed and actual performance is an industry-wide problem.

XRPL's advantage is that its claims are relatively honest—1,500 TPS sustained means exactly that, under realistic conditions. But even honest numbers require context. What transaction types? What network conditions? What failure rate?

This lesson teaches you to generate your own numbers so you never have to trust anyone else's.


Meaningful benchmarks must be:

  1. Representative

  2. Reproducible

  3. Comparable

  4. Actionable

Common Benchmark Anti-Patterns:

Anti-Pattern              | Problem                    | Reality
--------------------------|----------------------------|------------------
Best-case-only testing    | Hides failure modes        | Prod isn't best case
Single transaction type   | Doesn't reflect mix        | Real load is mixed
Zero network latency      | Unrealistic conditions     | Networks have latency
Pre-generated signatures  | Skips CPU work             | Signing takes time
Ignoring failures         | Overstates success rate    | Failures count
Peak vs. sustained        | Misleading capacity claims | Sustained matters
  • XRP Payments: 40%
  • Token Payments: 15%
  • DEX Offers: 25%
  • NFT Operations: 10%
  • AMM Operations: 5%
  • Other (escrow, checks): 5%
  • 100% simple payments
  • Pre-funded accounts only
  • No DEX interaction
  • No failures
  • Testnet or isolated mainnet segment
  • Geographic distribution of load generators
  • Variable network latency (50-200ms)
  • Some packet loss (0.1-1%)
  • Single machine submission
  • Local network only
  • Zero latency
  • Perfect connectivity

Before benchmarking, define what you're measuring:

Throughput Metrics:
├── Submitted TPS: Transactions sent to network
├── Accepted TPS: Transactions passing validation
├── Confirmed TPS: Transactions in validated ledgers
└── Effective TPS: Confirmed with desired outcome

Latency Metrics:
├── Submission latency: Time to acknowledgment
├── Confirmation latency: Time to validated status
├── End-to-end latency: User action to confirmed result
└── Percentiles: p50, p95, p99, p99.9

Reliability Metrics:
├── Success rate: % transactions confirmed
├── Error rate: % rejected or failed
├── Timeout rate: % no response in threshold
└── Throughput stability: Variance over time

  • Real network conditions
  • Multiple validators
  • No cost (test XRP is free)
  • Closest to production behavior
  • Shared environment (other traffic)
  • Can't control validator behavior
  • Rate limits may apply
  • Less reproducible

Best for: Realistic production simulation
```

  • Full control
  • Reproducible conditions
  • No interference
  • Can test edge cases
  • Doesn't reflect real network
  • Single-node consensus differs
  • Setup complexity
  • May miss production issues

Best for: Controlled experiments, stress testing
```

  • More features than testnet
  • Faster reset capability
  • Good for development testing
  • Less stable
  • May have unreleased features
  • Smaller validator set

Best for: Feature testing, development
```

Simple Architecture (Low-Volume Testing):

┌─────────────────┐
│  Load Generator │
│  (Single Node)  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  XRPL Server    │
│  (Public/Local) │
└─────────────────┘
  • Single point of failure
  • Network bottleneck
  • Can't saturate network
  • Max ~500 TPS realistic

Distributed Architecture (High-Volume Testing):

┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Gen 1   │ │ Gen 2   │ │ Gen 3   │ │ Gen N   │
│ (US-E)  │ │ (EU-W)  │ │ (APAC)  │ │ (...)   │
└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘
     │          │          │          │
     ▼          ▼          ▼          ▼
┌──────────────────────────────────────────────┐
│            XRPL Network (Multiple Servers)   │
└──────────────────────────────────────────────┘
  • Geographic distribution
  • No single bottleneck
  • Can saturate network
  • More realistic

Pre-requisites:

// Account setup before testing
const accounts = [];
for (let i = 0; i < 1000; i++) {
  const wallet = xrpl.Wallet.generate();
  // Fund from faucet (testnet) or genesis (private)
  await fundAccount(wallet.address, "10000000000"); // 10,000 XRP
  accounts.push(wallet);
}

Transaction Generation Patterns:

// Pattern 1: Constant Rate
async function constantRate(tps, duration) {
  const interval = 1000 / tps; // ms between transactions
  const endTime = Date.now() + duration;

while (Date.now() < endTime) {
    const start = Date.now();
    await submitTransaction();
    const elapsed = Date.now() - start;
    if (elapsed < interval) {
      await sleep(interval - elapsed);
    }
  }
}

// Pattern 2: Ramp Up
async function rampUp(startTps, endTps, rampDuration) {
  const steps = 10;
  const stepDuration = rampDuration / steps;
  const tpsIncrement = (endTps - startTps) / steps;

for (let i = 0; i < steps; i++) {
    const currentTps = startTps + (tpsIncrement * i);
    await constantRate(currentTps, stepDuration);
  }
}

// Pattern 3: Burst
async function burst(tps, burstDuration, cooldown, iterations) {
  for (let i = 0; i < iterations; i++) {
    await constantRate(tps, burstDuration);
    await sleep(cooldown);
  }
}

What to Record:

const measurement = {
  // Transaction identification
  txHash: "...",
  txType: "Payment",

// Timing
  timestamps: {
    created: 1699900000000,    // Transaction built
    signed: 1699900000005,     // Signature added
    submitted: 1699900000010,  // Sent to server
    acknowledged: 1699900000150, // Server response
    validated: 1699900003200,  // In validated ledger
  },

// Calculated latencies
  latencies: {
    signing: 5,       // ms
    submission: 5,    // ms
    network: 140,     // ms (ack - submitted)
    confirmation: 3050, // ms (validated - ack)
    total: 3200,      // ms (validated - created)
  },

// Outcome
  result: {
    success: true,
    ledgerIndex: 12345678,
    fee: "12",
    engineResult: "tesSUCCESS",
  },

// Context
  context: {
    testRun: "benchmark-2024-01-15-001",
    targetTps: 100,
    generator: "us-east-1",
  }
};

Why Percentiles Matter More Than Average:

  • 90 transactions: 3,000 ms

  • 9 transactions: 5,000 ms

  • 1 transaction: 60,000 ms (network issue)

  • Mean (average): 3,540 ms

  • Median (p50): 3,000 ms

  • p95: 5,000 ms

  • p99: 60,000 ms

  • For user experience: p99 (1 in 100 users sees 60 seconds)

  • For capacity planning: p95 (5% of traffic is slow)

  • For typical experience: p50 (median user)

  • NOT average (skewed by outliers)

Percentile Calculation:

function calculatePercentiles(latencies) {
  const sorted = [...latencies].sort((a, b) => a - b);
  const n = sorted.length;

return {
    p50: sorted[Math.floor(n * 0.50)],
    p75: sorted[Math.floor(n * 0.75)],
    p90: sorted[Math.floor(n * 0.90)],
    p95: sorted[Math.floor(n * 0.95)],
    p99: sorted[Math.floor(n * 0.99)],
    p999: sorted[Math.floor(n * 0.999)],
    min: sorted[0],
    max: sorted[n - 1],
    mean: latencies.reduce((a, b) => a + b, 0) / n,
  };
}

Minimum Sample Sizes:

  • p50: ~30 samples minimum (Central Limit Theorem)

  • p95: ~100 samples (need enough tail data)

  • p99: ~1,000 samples (100× the percentile)

  • p99.9: ~10,000 samples

  • p99 needs 100 / 0.01 = 10,000 for reliable estimate

  • p95 needs 100 / 0.05 = 2,000 for reliable estimate

  • Minimum: 10,000 transactions

  • Recommended: 100,000+ transactions

  • Duration: At least 10 minutes sustained

Confidence Intervals:

function confidenceInterval(data, confidence = 0.95) {
  const n = data.length;
  const mean = data.reduce((a, b) => a + b, 0) / n;
  const stdDev = Math.sqrt(
    data.reduce((sum, x) => sum + Math.pow(x - mean, 2), 0) / (n - 1)
  );

const zScore = 1.96; // 95% confidence
  const marginOfError = zScore * (stdDev / Math.sqrt(n));

return {
    mean,
    lower: mean - marginOfError,
    upper: mean + marginOfError,
    marginOfError,
  };
}

Anomaly Detection:

function identifyAnomalies(latencies) {
  const stats = calculatePercentiles(latencies);
  const iqr = stats.p75 - stats.p50; // Interquartile range approximation

const lowerBound = stats.p50 - (1.5 * iqr);
  const upperBound = stats.p75 + (1.5 * iqr);

const anomalies = latencies.filter(
    l => l < lowerBound || l > upperBound
  );

return {
    anomalies,
    anomalyRate: anomalies.length / latencies.length,
    bounds: { lower: lowerBound, upper: upperBound },
  };
}

Common Anomaly Sources:

Anomaly Pattern       | Likely Cause               | Investigation
----------------------|----------------------------|------------------
Periodic spikes       | GC pause, cron job         | Check server logs
Gradual degradation   | Memory leak, state growth  | Monitor resources
Bimodal distribution  | Cache hit/miss             | Analyze cache rates
Sudden step change    | Config change, deployment  | Check change logs
Random spikes         | Network issues             | Check connectivity

Standard Report Format:

# XRPL Performance Benchmark Report
  • Test date: 2024-01-15
  • Duration: 60 minutes
  • Total transactions: 250,000
  • Target TPS: 100
  • Achieved TPS: 98.3
  • Success rate: 99.7%
  • Network: XRPL Testnet
  • Transaction types: Mixed (40% payment, 25% DEX, ...)
  • Load generators: 5 (geographically distributed)
  • Accounts: 1,000 pre-funded

[Detailed findings, anomalies, bottlenecks identified]

[Optimization suggestions based on results]
```

Valid Comparisons:

Comparing XRPL benchmarks:
✓ Same transaction types
✓ Same network (testnet vs testnet)
✓ Same duration
✓ Same success criteria
✓ Same measurement methodology

Invalid comparisons:
✗ Payment-only vs mixed workload
✗ Private testnet vs public testnet
✗ 1-minute test vs 1-hour test
✗ Including failures vs excluding failures
✗ Different latency measurement points

Cross-Protocol Comparison Framework:

When comparing XRPL to other blockchains:

1. Define equivalent operations

1. Measure at equivalent points

1. Use same success criteria

1. Account for different finality models

Misinterpretation 1: Peak vs. Sustained

Claim: "Achieved 3,000 TPS!"
Reality: Peak for 10 seconds; sustained was 1,200 TPS
  • Peak TPS: 3,000 (burst capacity)
  • Sustained TPS: 1,200 (production capacity)
  • Always report both

Misinterpretation 2: Ignoring Failures

Claim: "99.9% success rate!"
Reality: Only counting transactions that got responses
  • Submitted: 10,000
  • Acknowledged: 9,950 (50 timeouts)
  • Confirmed: 9,900 (50 failed validation)
  • Actual success: 9,900 / 10,000 = 99.0%

Misinterpretation 3: Wrong Latency Point

Claim: "500ms latency!"
Reality: Time to acknowledgment, not finality
  • Submission ack: ~100-200ms
  • Included in ledger: ~3,000ms (varies)
  • Validated (final): ~3,500-4,500ms

"Latency" without qualification is meaningless.
```


Baseline Metrics to Track:

Performance Baselines:
├── Latency (p50, p95, p99 per transaction type)
├── Throughput (TPS per hour/day)
├── Error rates (by error type)
├── Resource utilization (CPU, memory, I/O, network)
└── Queue depths (pending transactions)

Example baseline definition:
{
  "payment_latency_p99": {
    "baseline": 4500,
    "warning": 5500,  // +22%
    "critical": 7000, // +55%
    "unit": "ms"
  },
  "success_rate": {
    "baseline": 99.5,
    "warning": 99.0,
    "critical": 98.0,
    "unit": "percent"
  }
}

Alert Configuration:

alerts:
  - name: "High Latency"
    metric: "confirmation_latency_p99"
    condition: "> 6000"
    duration: "5m"
    severity: "warning"

- name: "Critical Latency"

- name: "Elevated Error Rate"

- name: "Throughput Drop"

Automated Performance Testing:

  • Smoke test: Every deployment (basic functionality)

  • Performance test: Daily (latency, throughput)

  • Stress test: Weekly (capacity limits)

  • Soak test: Monthly (long-duration stability)

  • 100 transactions

  • Basic latency check

  • Error rate < 1%

  • 10,000+ transactions

  • Full percentile analysis

  • Compare to baseline

  • Ramp to 80% capacity

  • Hold for 30 minutes

  • Measure degradation

  • Sustained 50% capacity

  • Check for memory leaks

  • Verify stability


Percentiles reveal what averages hide—p99 can be 10× worse than p50

XRPL's published numbers are honest—1,500 TPS sustained is achievable under documented conditions

Testnet reasonably approximates mainnet—for performance testing, with some caveats

⚠️ Long-term performance stability—multi-hour tests may reveal issues not visible in short tests

⚠️ Cross-protocol comparisons—even with methodology, different finality models complicate comparison

📌 Drawing conclusions from small samples—p99 needs 10,000+ samples to be meaningful

📌 Ignoring failure modes in benchmarks—real systems fail; benchmarks should too

📌 Using peak numbers for capacity planning—sustained capacity is what matters

Performance benchmarking is a discipline, not just running a script. Meaningful results require careful methodology, adequate sample sizes, and honest reporting of both successes and failures. XRPL's performance is genuinely good—but don't take anyone's word for it, including Ripple's. The tools and techniques in this lesson let you verify claims independently.


Assignment: Create and execute a comprehensive XRPL benchmark suite.

Requirements:

  • Define 3 test scenarios (smoke, performance, stress)

  • Specify transaction mix, duration, success criteria

  • Document methodology completely

  • Build load generation scripts (JavaScript/Python)

  • Implement measurement collection

  • Create results analysis pipeline

  • Run all 3 scenarios on XRPL testnet

  • Collect at least 10,000 transactions per scenario

  • Record all measurements

  • Full statistical analysis with percentiles

  • Comparison to expected/baseline values

  • Anomaly identification and analysis

  • Recommendations based on findings

  • Sound methodology (25%)

  • Working implementation (25%)

  • Adequate sample sizes (25%)

  • Insightful analysis (25%)

Time investment: 4-5 hours


1. Why is p99 latency more important than average latency for user experience?

A) p99 is always higher than average
B) p99 represents the experience of 1% of users, which at scale is thousands of people
C) Averages are technically difficult to calculate
D) p99 is industry standard

Correct Answer: B


2. What is the minimum sample size needed for a reliable p99 latency measurement?

A) 100 samples
B) 1,000 samples
C) 10,000+ samples
D) 100,000+ samples

Correct Answer: C


3. A blockchain claims "sub-second finality." What question should you ask first?

A) What programming language is used?
B) Is this deterministic finality or optimistic confirmation?
C) How many nodes are in the network?
D) What is the token price?

Correct Answer: B


4. Which benchmark configuration would produce MISLEADING results for XRPL?

A) Mixed transaction types on testnet
B) Geographically distributed load generators
C) 100% simple payments with pre-funded accounts and no network latency
D) 60-minute sustained test with 100,000 transactions

Correct Answer: C


5. Your benchmark shows p50 = 3,800ms and p99 = 12,000ms. What does this indicate?

A) The system is performing well
B) There are significant outliers causing tail latency issues
C) The test methodology is flawed
D) Not enough data was collected

Correct Answer: B


  • Brendan Gregg, "Systems Performance"
  • Google SRE book, "Testing Reliability"
  • ACM SIGMETRICS papers on benchmark methodology
  • "Practical Statistics for Data Scientists"
  • Understanding percentiles and confidence intervals
  • XRPL testnet documentation
  • rippled performance tuning guides

For Next Lesson:
Phase 2 begins with Lesson 6: Cryptographic Optimization—the first optimization technique for improving XRPL throughput.


End of Lesson 5

Total words: ~6,000
Estimated completion time: 55 minutes reading + 4-5 hours for deliverable

Key Takeaways

1

Design for reality

: Benchmarks must use realistic transaction mixes, network conditions, and failure scenarios. Synthetic best-case tests are marketing, not engineering.

2

Percentiles over averages

: p99 latency affects 1% of users—at scale, that's thousands of people. Always report p50, p95, p99.

3

Sample size matters

: p99 requires ~10,000 samples for reliability. Short tests with few transactions produce meaningless statistics.

4

Methodology is everything

: Two honest engineers can get 10× different results from the same system with different methodologies. Document everything.

5

Continuous monitoring extends benchmarking

: One-time tests establish baselines; ongoing monitoring detects regressions before users do. ---