Part 3 of 5 · performance · UnminerMac v0.19

The 3.8× speedup hiding in the inner loop

Once shares were accepting, the miner ran at 1.04 MH/s. Per-job work was being redone per-nonce. Splitting "what changes per iteration" from "what's constant for the duration" took the same M5 from 1.04 to 3.94 MH/s.

Verox Studio engineering notes · M5 (10-core, 4P+6E, macOS Tahoe 26.4)

With shares accepting, we started actually mining. The reported hashrate was 1.04 MH/s on 7 threads. That's terrible — about 150 KH/s per thread.

The v0.18 release notes had reported 5.77 MH/s on 4 threads (~1.44 MH/s/thread). What had we lost?

The answer turned out to be: everything we'd just added in part 2 was being recomputed on every nonce iteration, even though it didn't depend on the nonce.

The PBaaS pipeline now looked like this per iteration:

for (uint64_t nonce = start; nonce < end; nonce++) {
    // build 196-byte preHeader with nonce embedded
    // run blake2b("VerusDefaultHash", preHeader, 32)
    // memcpy blake output to buf[124..156]
    // memcpy nonce to buf[108..140]
    // canonical clear: zero out [4..99], [104..107], [108..139], [151..214]
    // CVerusHashV2.Reset()
    // CVerusHashV2.Write(buf, 299) — runs the full VerusHash pipeline
    // compare hash to target, submit if good
}

The thing is, almost none of that varies with the nonce except for 8 bytes. The job is constant for the duration. The blake2b input changes by 8 bytes per iteration (the nonce counter inside it), which means the blake output also changes — but the rest of the 196-byte preHeader is constant.

And inside CVerusHashV2, there's another layer: the verusclhash key cache. The first call to .Write() with a given buffer seeds a key derived from the buffer's leading bytes. Subsequent calls with the same prefix reuse the cached key. We were destroying that cache every iteration by calling .Reset().

The fix was to recognize what's per-job vs what's per-nonce:

Operation	Per job	Per nonce
Build the 196-byte preHeader skeleton	✓
Initialize blake2b state with personalization	✓
Pre-compute CVerusHashV2 key from the canonical buffer prefix	✓
Memcpy the nonce into the partial-fill region of the scratch buffer		✓ (8 bytes)
Run the VerusHash 2.2 pipeline (haraka512 + clhash + haraka512_keyed × 16 rounds)		✓ (unavoidable)

The inner loop went from this:

// Before: ~960ns per iteration on M5 P-core
for (...) {
    build_preheader(...);          // ~120ns
    blake2b_VerusDefault(...);     // ~280ns
    memcpy + canonical_clear(...); // ~200ns
    vh2.Reset();                   // ~40ns
    vh2.Write(scratch, 299);       // ~320ns (incl. key cache rebuild)
}

To this:

// After: ~250ns per iteration on M5 P-core
// Per-job setup runs once:
build_preheader_template(...);
preseed_clhash_key(...);

// Per-nonce hot loop:
for (uint64_t n = 0; n < 50000; n++) {
    // ONE mutation: the 8-byte counter in the partial fill.
    if (body_tail_room > 0) {
        memcpy(scratch_tail, &local_nonce, body_tail_room);
        memcpy(submit_tail,  &local_nonce, body_tail_room);
    }
    vh2.Reset();
    vh2.Write(scratch.data(), scratch.size());
    vh2.Finalize2b(hash);
    // compare + maybe submit
    local_nonce += n_threads;
}

Result on the same M5, same pool, same job:

[STATS] 1.04 MH/s | 7 threads → [STATS] 3.94 MH/s | 7 threads

3.8× speedup. Per-thread throughput went from ~150 KH/s to ~563 KH/s. The bottleneck moved from "memory churn + crypto setup" to actually-running-the-hash, which is what should be the bottleneck.

The lesson is one of those things that sounds obvious in retrospect: profile your hot loop before adding code to it. We'd added the PBaaS preprocessing in part 2 to fix a correctness bug, and the correctness fix worked, but the work-per-iteration ballooned without anyone noticing because we were chasing accepted shares, not hash-per-second.