1. What is measured
Each algorithm hashes a fixed 80-byte input (Bitcoin block-header size) in a tight loop for N seconds. The script records:
hashes_per_secns_per_hash- thread count and host CPU model (for reproducibility)
- speedup factor versus the SHA-256d reference
2. How to run
cd b3chain
pip3 install blake3
python3 contrib/testing/compare/compare-pow-throughput.py
# customise
python3 contrib/testing/compare/compare-pow-throughput.py \
--duration 5 \
--threads 1,4,8
# omit the BLAKE3-portable (no-SIMD) measurement
python3 contrib/testing/compare/compare-pow-throughput.py --no-portable
3. Sample output
Measuring hash throughput (input=80B, duration=3.0s) sha256 threads=1 -> 12,450,000 h/s ( 80.3 ns/hash) sha256d threads=1 -> 6,210,000 h/s ( 161.0 ns/hash) blake3 threads=1 -> 9,800,000 h/s ( 102.0 ns/hash) blake3d threads=1 -> 5,180,000 h/s ( 193.1 ns/hash) sha256 threads=8 -> 91,600,000 h/s ( 10.9 ns/hash) blake3 threads=8 -> 74,300,000 h/s ( 13.5 ns/hash) [no-simd] blake3 threads=8 -> 19,400,000 h/s ( 51.5 ns/hash) | algorithm | threads | MH/s | ns/hash | vs sha256d (1T) | |-----------|--------:|-----:|--------:|----------------:| | sha256 | 1 |12.45 | 80.3 | 2.00x | | sha256d | 1 | 6.21 | 161.0 | 1.00x | | blake3 | 1 | 9.80 | 102.0 | 1.58x | | blake3d | 1 | 5.18 | 193.1 | 0.83x |
4. Interpreting the numbers
- Single-thread, small inputs: BLAKE3's tree advantage is
not visible. Both algorithms are dominated by per-call overhead
(Python FFI, instruction startup) at this scale. The SHA-256
result is sometimes faster here because Python's
hashlib.sha256is a thin wrapper around OpenSSL, with very low per-call overhead. - Multi-thread or larger inputs: BLAKE3's SIMD lanes (4× for SSE2, 8× for AVX2, 16× for AVX-512) extract real parallelism from a single hash. SHA-256 cannot parallelise within one hash — its compression chain is strictly sequential.
- BLAKE3 portable mode:
BLAKE3_NO_SIMD=1disables all SIMD intrinsics, falling back to pure C. Useful as a sanity check that the SIMD output matches portable byte-for- byte (this is what audit-simd-blake3.py verifies). - BLAKE3d vs SHA-256d: the doubled-hash variants both pay 2× the cost. Their ratio matches the single-hash ratio.
5. JSON schema
{
"comparison": "pow-throughput",
"host": {"cpu_model": "...", "cores": 16, ...},
"timestamp": "2026-05-13T22:18:43+00:00",
"rows": [
{
"algo": "sha256d", "threads": 1, "input_bytes": 80,
"duration_s": 3.0, "total_hashes": 18632000,
"hashes_per_sec": 6210000, "ns_per_hash": 161.0
},
...
],
"summary": {
"speedup_blake3d_vs_sha256d_single_thread": 0.83,
"speedup_blake3d_vs_sha256d_multi_thread": 0.91
}
}
6. Common pitfalls
- Thermal throttling — running the benchmark on a
laptop with no active cooling will silently drop throughput by
20-40% mid-run. Mains-powered desktop or short
--durationrecommended. - Other CPU load — close other applications. The benchmark is single-process but a busy system reduces effective cycles.
- Python overhead — for absolute throughput, prefer the
C-based bench in
src/bench/. Python results are directionally correct but understate both algorithms by the FFI overhead.