Part 4 of 5 · GPU primitives · UnminerMac v0.19

Five Metal primitives, byte-perfect

Porting Haraka256/512, clmul64, mulhrs, and precompReduction64 to Metal Shading Language, validated against ~2,100 test vectors. T-table AES, the carryless multiply Metal doesn't have, and the GF(2⁶⁴) reduction that closes the pipeline.

Verox Studio engineering notes · M5 (10-core, 4P+6E, macOS Tahoe 26.4)

With CPU mining working and accepting shares, the next question was GPU acceleration. Apple Silicon GPUs are interesting for crypto mining because they share unified memory with the CPU (no PCIe roundtrip), they have a lot of compute (~3.85 TFLOPS on M5), and almost no one has tried to use them for proof-of-work because Apple doesn't sell to miners.

The plan: port VerusHash 2.2's five cryptographic building blocks to Metal Shading Language, validate each one byte-perfect against the CPU reference, then integrate them into a single mining kernel.

haraka256 — the explicit AES round that wouldn't validate

Haraka is a low-latency permutation built from AES rounds. haraka256 processes 32-byte inputs through 5 outer rounds, each containing 4 AES sub-rounds. We wrote a "by hand" Metal kernel that implemented SubBytes + ShiftRows + MixColumns + AddRoundKey as separate operations. It produced wrong output.

We debugged for hours. We thought the round constants were in the wrong byte order. They weren't. We thought ShiftRows was rotating the wrong direction. It wasn't. We thought MixColumns had a sign-extension bug in the gf2 doubling. It didn't. Every individual operation tested correctly in isolation, but composed in the round they didn't.

The eventual fix was to throw away the explicit round implementation and adopt the T-table form from the canonical CPU reference. T-tables fuse SubBytes, ShiftRows, and MixColumns into a single 256-entry lookup keyed by the input byte and the output row. The CPU reference uses them because they're fast; we adopted them because there's one fused operation to get right instead of three separate ones to compose.

// One AES round in T-table form: 16 lookups + 12 XORs + 4 RK XORs.
inline void aesenc_tt(thread uint *s, constant uint *rk) {
    uint x0 = s[0], x1 = s[1], x2 = s[2], x3 = s[3];
    uint y0, y1, y2, y3;

    y0 = t_row((uchar)( x0        & 0xff), 0);
    y1 = t_row((uchar)( x1        & 0xff), 0);
    // ... 14 more lookups for rows 1, 2, 3 with shift ...

    s[0] = y0 ^ rk[0]; s[1] = y1 ^ rk[1];
    s[2] = y2 ^ rk[2]; s[3] = y3 ^ rk[3];
}

Byte-perfect on the first try. Test vector from the Haraka v2 paper (8027ccb87949774b…) matched on the GPU.

haraka512 — same template, twice as wide

With the T-table approach validated for 256-bit, scaling to 512-bit was a clean port: 4 halves of 16 bytes instead of 2, 4 round constants per sub-round instead of 2, the MIX4 mixing layer instead of MIX2, and a 32-byte TRUNCSTORE output instead of 32-byte feed-forward.

Validated against the paper test vector be7f723b4e80a998… first try.

clmul64 — the carryless multiply that Metal doesn't have

This is the part that seemed hard before we wrote it. verusclhash is built almost entirely around 64×64-bit carryless multiplication — multiplying two 64-bit polynomials over GF(2) to produce a 128-bit polynomial. On x86 this is one instruction (_mm_clmulepi64_si128). On ARM it's vmull_p64. On Metal, it doesn't exist as a primitive.

The CPU reference includes a portable software implementation using the 4-bit window technique: precompute 16 entries (u[k] = k·b for k in 0..15), then unroll the multiplication as 16 lookups + XORs over windows of a. The total cost is roughly 16 GF doublings + 16 XORs to build the table, then 16 shifted lookups for the multiply, plus a small "repair" pass for high-bit carries.

This ported cleanly to MSL with one caveat: uint64_t left-shifts past 63 are undefined behavior in MSL just like in C. The loop bounds in the algorithm keep (64 - i) in the range [4, 60], so we're safe — but the kernel code documents it.

We validated 1032 (a, b) pairs:

8 hand-picked edge cases: (0,0), (1,1), (max,max), (alt,alt), (lsb,max), (msb,msb), and two ascending/descending hex words
1024 pseudo-random pairs from xorshift64* (deterministic — failures would be bit-reproducible)

All 1032 byte-perfect. The interesting edge cases:

max × max → 0x5555…5555 / 0x5555…5555 — alternating bit pattern is a known GF(2) identity for all-ones inputs
msb × msb → 0 / 0x4000…0000 — x⁶³ × x⁶³ = x¹²⁶, not x¹²⁷, which is the test you'd accidentally fail if you mixed up GF poly mul with regular integer mul

mulhrs_epi16 and precompReduction64 — the closing primitives

Two more pieces to make verusclhash complete:

mulhrs_epi16 — Intel SSSE3's "packed signed 16-bit multiply, high half, rounded". 8-lane SIMD, lane-independent, ~5 lines: r[i] = (int16_t)((a[i] * b[i] + 0x4000) >> 15). Validated against 517 vectors (5 edge + 512 random).
precompReduction64 — the GF reduction that closes the verusclhash pipeline, taking a 128-bit accumulator down to 64 bits under the polynomial x⁶⁴ + x⁴ + x³ + x + 1. Reuses the same clmul64 we already validated, plus a 16-byte lookup table and two XORs. 518 vectors passed.

Status of the Metal primitives

Primitive	Test vectors	Status
haraka256	4	✓ byte-perfect
haraka512	4 (incl. paper)	✓ byte-perfect
haraka512_keyed	9	✓ byte-perfect
clmul64	1032	✓ byte-perfect
mulhrs_epi16	517	✓ byte-perfect
precompReduction64	518	✓ byte-perfect
Total	~2100	all pass

Every primitive validated. The remaining work is the 32-iteration selector loop at the heart of verusclhash that ties them together, plus integration into an actual mining loop. Part 5 covers what happened when we finished that integration — and what the honest performance result taught us about GPUs and mining algorithms.