We built the GPU miner. Then we learned why it doesn't matter.
First publicly-known Apple Silicon Metal port of VerusHash 2.2. Pool-accepted shares on the first live test. Hashrate: 0.51 MH/s vs the same M5's CPU at 3.88 MH/s. Same story as the RTX 3080.
An earlier draft of this post predicted 12–20 MH/s on Metal. We finished the integration. Pool-accepted shares now come out of the Apple M5 GPU. The number is 0.51 MH/s. The original prediction was wrong by an order of magnitude. Here's why, and why we're shipping the GPU mode anyway as a research artifact rather than a production speedup.
What we built
v0.20 ships:
- A Metal compute kernel (
verus_mine_kernel) that does the full Reset + Write + Finalize2b pipeline per GPU thread, mutates the 8-byte nonce slot on the fly, and compares the resulting hash against the pool target on the GPU. Winners are atomically logged. - An Objective-C++ bridge (
metal_bridge.mm) the C++ miner calls via a small C-compatible API. --gpuflag +gpu_worker_loopin the miner with trust-but-verify: the first 16 winners are recomputed on CPU and compared before submission. 3 mismatches → abort.- UI toggle "Use Metal GPU (experimental)" in Settings.
First live test: [GPU-SHARE] nonce=c86e47 ... [SHARE ✓] accepted (total: 1). Pool accepted the GPU-found share on the first try. So we know the algorithm is correct end-to-end.
What the numbers actually look like
| Setup | Hashrate | Key memory |
|---|---|---|
| M5 CPU, 7 threads | 3.88 MH/s | ~64 KB |
| M5 GPU, batch=256 | 0.05 MH/s | 6 MB |
| M5 GPU, batch=8192 | 0.51 MH/s | 192 MB |
| M5 GPU, batch=32768 | 0.68 MH/s | 768 MB |
So the GPU is roughly 5–7× slower than the CPU, and bigger batches barely help. Why?
The actual ceiling, with arithmetic
The M5 GPU has 10 cores × 128 ALUs = 1,280 ALUs at ~1.5 GHz, peak ~3.85 TFLOPS FP32 (measured roofline analysis). On paper, plenty of compute for mining.
But VerusHash's hot loop (verusclhash) has properties that GPUs structurally hate:
- An 8-way
switchper iteration, 32 iterations per hash. Threads in the same 32-lane SIMD warp take different branches on different selector values. The hardware serializes the divergent lanes. Apple's own WWDC20 Metal guidance explicitly warns about this — penalty is up to 8× on heavy divergence. - 24 KB of per-hash state (the CL hash key). M5 threadgroup memory is 32 KB/core. That means at most ~1 hash in flight per threadgroup, and with the GPU's register-file constraints, you get maybe ~40 concurrent hashes total across all 10 cores. The thousands of threads you'd need to hide GPU latency simply do not fit.
- No native carryless multiply. CPUs have
PCLMULQDQ(Intel) andPMULL(ARMv8). Apple Silicon GPUs do not. We emulate it with shifts/XORs at ~10× the cost of the hardware instruction. - Zero intra-hash parallelism. A single VerusHash is a serial dependency chain — you can't SIMD-vectorize within one hash. All parallelism comes from running many independent hashes simultaneously, which collides with the memory constraint above.
Realistic Metal ceiling on M5 with weeks of further optimization (threadgroup-shared T-tables, smaller per-thread footprint, multi-dispatch pipelining): ~1–2 MH/s. Still below the 3.88 MH/s CPU.
This isn't an Apple Silicon problem. It's a VerusHash problem.
The thing that makes this resigned-shrug funny instead of frustrating: VerusHash beats GPUs on every platform that's ever existed. From Hashrate.no's VerusHash benchmarks:
| Hardware | VerusHash 2.2 |
|---|---|
| AMD EPYC 9754 (128-core CPU) | 272 MH/s |
| AMD Ryzen 9 7950X (16-core CPU) | 63.97 MH/s |
| Apple M5 (10-core, our baseline) | 3.88 MH/s |
| NVIDIA RTX 3080 (ccminer-verus) | ~14 MH/s |
| NVIDIA GTX 1080 Ti | ~6 MH/s |
The top-ranked benchmarks don't have a single GPU. NVIDIA's RTX 3080 loses to a 16-core consumer CPU by 4–5×. The "RTX 4090 does 80 MH/s on VerusHash" number that floats around in chat is not corroborated by any benchmark site we could find. Crypto Mining Blog wrote in 2019: "GPU mining of VerusHash is uneconomical." Five years and two algorithm revisions later, that's still true.
It's by design. Verus 2.0's release notes explicitly target CPU mining. RandomX (Monero's algorithm) does the same thing for the same reasons: branchy VM with random scratchpad access, plus operations CPUs do natively and GPUs have to emulate. CPUs and GPUs are fundamentally different machines, and an algorithm that picks features one has and the other doesn't will be lopsided.
What this is good for
If GPU mining VerusHash is a dead-end performance-wise, why ship it? Three reasons:
- It's the first one. We searched GitHub, the Verus org, the Apple developer forums, and r/CryptoMining / r/AppleSilicon — nobody has published a Metal port of VerusHash 2.2 before, with or without accepted shares. The kernel + bridge (MIT-licensed in verusminer/metal/) is now the reference any future Apple Silicon mining effort can fork. Maybe someone reading this from the Verus team takes it further with bit-sliced AES; the structural ceiling is ~1–2 MH/s, but our 0.51 is well below that, and the kernel design has obvious places to improve.
- It's a correctness artifact. The cryptographic primitives (Haraka256/512, CL hash, AES T-tables, GF(2128) clmul, mulhrs, precomp reduction) all have
validate_*.swiftharnesses comparing the GPU output byte-for-byte against the canonicalCVerusHashV2CPU reference. 7 harnesses, ~2,100 test vectors, all green. If you want to study VerusHash on a GPU, this is a clean reference implementation that's known to match the canonical. - It made the CPU faster. Building the GPU port forced us to think hard about what work is per-job (do once, cache) vs per-nonce (must redo). That insight produced the 3.8× CPU speedup covered in part 3 — which is the entire actual revenue improvement of this whole arc. The GPU "failed" at its stated goal but caused the CPU to do its real job better.
What we're not going to do
Build a bit-sliced AES kernel. Spend three weeks chasing a theoretical 8–12 MH/s on a path that might match the CPU. Solo, the ROI is bad. The honest answer is that VerusHash on Apple Silicon GPU is a research-grade demonstrator, the CPU is the production miner, and the next optimization-effort dollar is better spent on the CPU side (P-core pinning, dual-mining VerusHash on P-cores + RandomX on E-cores, etc.) than on the GPU.
If you're working on Apple Silicon Metal mining and disagree — open an issue, fork the kernel, prove us wrong. Honestly, that would be the best outcome of shipping this.
Where the code lives:
- Kernel: verusminer/metal/verus_hash_v2.metal
- Bridge: verusminer/metal/metal_bridge.mm
- v0.20.0 release with the GPU toggle: github releases
- Upstream offer to the Verus team: VerusCoin/VerusCoin#626