← All posts
Part 5 of 5 · honest postmortem · UnminerMac v0.20

We built the GPU miner. Then we learned why it doesn't matter.

First publicly-known Apple Silicon Metal port of VerusHash 2.2. Pool-accepted shares on the first live test. Hashrate: 0.51 MH/s vs the same M5's CPU at 3.88 MH/s. Same story as the RTX 3080.

Verox Studio engineering notes · M5 (10-core, 4P+6E, macOS Tahoe 26.4)

An earlier draft of this post predicted 12–20 MH/s on Metal. We finished the integration. Pool-accepted shares now come out of the Apple M5 GPU. The number is 0.51 MH/s. The original prediction was wrong by an order of magnitude. Here's why, and why we're shipping the GPU mode anyway as a research artifact rather than a production speedup.

What we built

v0.20 ships:

First live test: [GPU-SHARE] nonce=c86e47 ... [SHARE ✓] accepted (total: 1). Pool accepted the GPU-found share on the first try. So we know the algorithm is correct end-to-end.

What the numbers actually look like

SetupHashrateKey memory
M5 CPU, 7 threads3.88 MH/s~64 KB
M5 GPU, batch=2560.05 MH/s6 MB
M5 GPU, batch=81920.51 MH/s192 MB
M5 GPU, batch=327680.68 MH/s768 MB

So the GPU is roughly 5–7× slower than the CPU, and bigger batches barely help. Why?

The actual ceiling, with arithmetic

The M5 GPU has 10 cores × 128 ALUs = 1,280 ALUs at ~1.5 GHz, peak ~3.85 TFLOPS FP32 (measured roofline analysis). On paper, plenty of compute for mining.

But VerusHash's hot loop (verusclhash) has properties that GPUs structurally hate:

Realistic Metal ceiling on M5 with weeks of further optimization (threadgroup-shared T-tables, smaller per-thread footprint, multi-dispatch pipelining): ~1–2 MH/s. Still below the 3.88 MH/s CPU.

This isn't an Apple Silicon problem. It's a VerusHash problem.

The thing that makes this resigned-shrug funny instead of frustrating: VerusHash beats GPUs on every platform that's ever existed. From Hashrate.no's VerusHash benchmarks:

HardwareVerusHash 2.2
AMD EPYC 9754 (128-core CPU)272 MH/s
AMD Ryzen 9 7950X (16-core CPU)63.97 MH/s
Apple M5 (10-core, our baseline)3.88 MH/s
NVIDIA RTX 3080 (ccminer-verus)~14 MH/s
NVIDIA GTX 1080 Ti~6 MH/s

The top-ranked benchmarks don't have a single GPU. NVIDIA's RTX 3080 loses to a 16-core consumer CPU by 4–5×. The "RTX 4090 does 80 MH/s on VerusHash" number that floats around in chat is not corroborated by any benchmark site we could find. Crypto Mining Blog wrote in 2019: "GPU mining of VerusHash is uneconomical." Five years and two algorithm revisions later, that's still true.

It's by design. Verus 2.0's release notes explicitly target CPU mining. RandomX (Monero's algorithm) does the same thing for the same reasons: branchy VM with random scratchpad access, plus operations CPUs do natively and GPUs have to emulate. CPUs and GPUs are fundamentally different machines, and an algorithm that picks features one has and the other doesn't will be lopsided.

What this is good for

If GPU mining VerusHash is a dead-end performance-wise, why ship it? Three reasons:

  1. It's the first one. We searched GitHub, the Verus org, the Apple developer forums, and r/CryptoMining / r/AppleSilicon — nobody has published a Metal port of VerusHash 2.2 before, with or without accepted shares. The kernel + bridge (MIT-licensed in verusminer/metal/) is now the reference any future Apple Silicon mining effort can fork. Maybe someone reading this from the Verus team takes it further with bit-sliced AES; the structural ceiling is ~1–2 MH/s, but our 0.51 is well below that, and the kernel design has obvious places to improve.
  2. It's a correctness artifact. The cryptographic primitives (Haraka256/512, CL hash, AES T-tables, GF(2128) clmul, mulhrs, precomp reduction) all have validate_*.swift harnesses comparing the GPU output byte-for-byte against the canonical CVerusHashV2 CPU reference. 7 harnesses, ~2,100 test vectors, all green. If you want to study VerusHash on a GPU, this is a clean reference implementation that's known to match the canonical.
  3. It made the CPU faster. Building the GPU port forced us to think hard about what work is per-job (do once, cache) vs per-nonce (must redo). That insight produced the 3.8× CPU speedup covered in part 3 — which is the entire actual revenue improvement of this whole arc. The GPU "failed" at its stated goal but caused the CPU to do its real job better.

What we're not going to do

Build a bit-sliced AES kernel. Spend three weeks chasing a theoretical 8–12 MH/s on a path that might match the CPU. Solo, the ROI is bad. The honest answer is that VerusHash on Apple Silicon GPU is a research-grade demonstrator, the CPU is the production miner, and the next optimization-effort dollar is better spent on the CPU side (P-core pinning, dual-mining VerusHash on P-cores + RandomX on E-cores, etc.) than on the GPU.

If you're working on Apple Silicon Metal mining and disagree — open an issue, fork the kernel, prove us wrong. Honestly, that would be the best outcome of shipping this.

Where the code lives: