A 5-post series. Read in order or jump in anywhere. The arc: get any share accepted → make it fast → port it to the GPU → learn why that doesn't help. Code is MIT-licensed, links throughout.
Why every share was rejected
The pool said low difficulty share on every submission. The algorithm was right; the preprocessing wasn't. Three days of hunting led to a single missing blake2b call with a specific personalization string.
Reverse-engineering LuckPool's hash pipeline
VerusHash on PBaaS chains doesn't hash the raw block header. It hashes a canonically-cleared version that has a specific blake2b("VerusDefaultHash", …) digest embedded at a specific offset. Here's how we found that out.
The 3.8× speedup hiding in the inner loop
Once shares were accepting, the miner ran at 1.04 MH/s. Per-job work was being redone per-nonce. Splitting "what changes every iteration" from "what's constant for the duration" took the same M5 from 1.04 to 3.94 MH/s.
Read part 3 →Five Metal primitives, byte-perfect
Porting Haraka256/512, clmul64, mulhrs, and precompReduction64 to Metal Shading Language, validated against ~2,100 test vectors. T-table AES, the carryless multiply Metal doesn't have, and the GF(264) reduction that closes the pipeline.
Read part 4 →We built the GPU miner. Then we learned why it doesn't matter.
First publicly-known Apple Silicon Metal port of VerusHash 2.2. Pool-accepted shares on the first live test. Hashrate: 0.51 MH/s vs the same M5's CPU at 3.88 MH/s. Same story as the RTX 3080. The honest performance story, the M5 GPU spec dive, and why we're shipping it anyway.
Read part 5 →