ned Productions Consulting


Technology musings by Niall Douglas
ned Productions Consulting
(an expert advice and services company based in Ireland)


Wednesday 27th November 2013 3.52am

Link shared: https://github.com/BoostGSoC/boost.afio/commits/content_hashing_merge

Was very pleased today with my fourth attempt at an asynchronous batch hash engine for Boost.AFIO which can schedule digest hashing of arbitrary length texts to all SIMD streams on all CPU cores (with a preference for filling SIMD streams before CPU cores). Getting the scheduler design right was quite tricky, because as individual hash operations end, the tail operations from the end of the scheduled batch need moving downwards to fill holes such that individual SIMD streams in CPUs aren't doing useless work, and this must be done while each CPU core runs entirely independently from the others (a CPU core only breaks processing if one stream runs out of data) so getting the locking semantics correct was non-trivial.

Benchmarks so far are good, but not stellar: for a single stream SHA-256, I'm seeing 17.03 cycles/byte where a reference SHA-256 implementation should do 14.89 cycles/byte on Intel Ivy Bridge, so a 14.3% overhead which is still too much. There is too much complexity still in the SHA round scheduling, and I'll need to get into the disassembly to try and figure out why (I suspect there is too much branching, and I'm using a universal rather than compile-time specialised 4-SHA and 1-SHA implementation so the loops in the code are too verbose and cache consuming).

For sheer throughput though, this engine is pretty sweet: with all eight hyperthreaded cores going at full speed, I'm already getting 1.854 cycles/byte which is 8.03x faster than a traditional SHA-256. You might wonder how that's possible with only eight threads - it's due to the 4-SHA256 SSE2 implementation which can get each core to process four SHAs at once, so in fact the engine is chewing through a full 32 SHA digests simultaneously. Suddenly only a 8x improvement doesn't look so good, and I think 1.4 cycles/byte is doable (a 12x improvement), though I'd imagine one would start running into main memory bandwidth limits not long after that as the Ivy Bridge memory prefetchers aren't much use with this hash engine, as there are only four of them which isn't enough to cope with 32 concurrency memory read streams.

Anyway, all that will come after Thanksgiving as we're off to Indiana tomorrow to visit with Megan's folks. One of the last times we'll see them without a baby in hand! Have a great Thanksgiving!