Hash throughput — SHA-256 vs BLAKE3

How many hashes per second each algorithm achieves on the host CPU, single-threaded and N-threaded, with and without BLAKE3 SIMD lanes.

Script: compare-pow-throughput.py Output: results/pow-throughput-*.json Type: microbenchmark

1. What is measured

Each algorithm hashes a fixed 80-byte input (Bitcoin block-header size) in a tight loop for N seconds. The script records:

  • hashes_per_sec
  • ns_per_hash
  • thread count and host CPU model (for reproducibility)
  • speedup factor versus the SHA-256d reference

2. How to run

cd b3chain
pip3 install blake3
python3 contrib/testing/compare/compare-pow-throughput.py

# customise
python3 contrib/testing/compare/compare-pow-throughput.py \
    --duration 5 \
    --threads 1,4,8

# omit the BLAKE3-portable (no-SIMD) measurement
python3 contrib/testing/compare/compare-pow-throughput.py --no-portable

3. Sample output

Measuring hash throughput (input=80B, duration=3.0s)

  sha256     threads=1  ->     12,450,000 h/s (   80.3 ns/hash)
  sha256d    threads=1  ->      6,210,000 h/s (  161.0 ns/hash)
  blake3     threads=1  ->      9,800,000 h/s (  102.0 ns/hash)
  blake3d    threads=1  ->      5,180,000 h/s (  193.1 ns/hash)
  sha256     threads=8  ->     91,600,000 h/s (   10.9 ns/hash)
  blake3     threads=8  ->     74,300,000 h/s (   13.5 ns/hash)
  [no-simd] blake3 threads=8 ->     19,400,000 h/s (   51.5 ns/hash)

| algorithm | threads | MH/s | ns/hash | vs sha256d (1T) |
|-----------|--------:|-----:|--------:|----------------:|
| sha256    |       1 |12.45 |    80.3 |           2.00x |
| sha256d   |       1 | 6.21 |   161.0 |           1.00x |
| blake3    |       1 | 9.80 |   102.0 |           1.58x |
| blake3d   |       1 | 5.18 |   193.1 |           0.83x |

4. Interpreting the numbers

  • Single-thread, small inputs: BLAKE3's tree advantage is not visible. Both algorithms are dominated by per-call overhead (Python FFI, instruction startup) at this scale. The SHA-256 result is sometimes faster here because Python's hashlib.sha256 is a thin wrapper around OpenSSL, with very low per-call overhead.
  • Multi-thread or larger inputs: BLAKE3's SIMD lanes (4× for SSE2, 8× for AVX2, 16× for AVX-512) extract real parallelism from a single hash. SHA-256 cannot parallelise within one hash — its compression chain is strictly sequential.
  • BLAKE3 portable mode: BLAKE3_NO_SIMD=1 disables all SIMD intrinsics, falling back to pure C. Useful as a sanity check that the SIMD output matches portable byte-for- byte (this is what audit-simd-blake3.py verifies).
  • BLAKE3d vs SHA-256d: the doubled-hash variants both pay 2× the cost. Their ratio matches the single-hash ratio.

5. JSON schema

{
  "comparison":  "pow-throughput",
  "host":        {"cpu_model": "...", "cores": 16, ...},
  "timestamp":   "2026-05-13T22:18:43+00:00",
  "rows": [
    {
      "algo": "sha256d", "threads": 1, "input_bytes": 80,
      "duration_s": 3.0, "total_hashes": 18632000,
      "hashes_per_sec": 6210000, "ns_per_hash": 161.0
    },
    ...
  ],
  "summary": {
    "speedup_blake3d_vs_sha256d_single_thread": 0.83,
    "speedup_blake3d_vs_sha256d_multi_thread": 0.91
  }
}

6. Common pitfalls

  • Thermal throttling — running the benchmark on a laptop with no active cooling will silently drop throughput by 20-40% mid-run. Mains-powered desktop or short --duration recommended.
  • Other CPU load — close other applications. The benchmark is single-process but a busy system reduces effective cycles.
  • Python overhead — for absolute throughput, prefer the C-based bench in src/bench/. Python results are directionally correct but understate both algorithms by the FFI overhead.

7. Source files