ned Productions Consulting


Technology musings by Niall Douglas
ned Productions Consulting
(an expert advice and services company based in Ireland)


Friday 11th August 2017 4.18am

You might remember that last year I spent lots of money on a MacBook Pro for one very specific reason: its NVMe SSD flash hard drive. Despite its supposed state-of-the-art performance, the various Windows benchmarking tools never really showed it to be particularly all that fast. Sure, with ridiculous queue depths and threads thrown at it then you could hit 3Gb/sec, but random 4Kb @ QD1 benchmarking always showed pretty poor results, maybe 2,500 IOPS (13Mb/sec) which is about a 400us average response time.

Last two nights I decided to put together a little benchmark using my low level i/o library AFIO v2, and well it does really rather better than the typical Windows benchmarking tools. This is for direct write-through i/o (i.e. they complete when data is completely on storage) on Windows 10 x64:

50% of random 4Kb reads @ QD1 complete within 0.66 microseconds.
95% of random 4Kb reads @ QD1 complete within 498 microseconds.
99.999% of random 4Kb reads @ QD1 complete within 2,818 microseconds.
Average random 4Kb reads @ QD1 complete within 83 microseconds (approx 12,000 IOPS or 47Mb/sec).

50% of random 4Kb writes @ QD1 complete within 0.66 microseconds.
95% of random 4Kb writes @ QD1 complete within 347 microseconds.
99.999% of random 4Kb writes @ QD1 complete within 25,146 microseconds.
Average random 4Kb writes @ QD1 complete within 54 microseconds (approx 18,500 IOPS or 72Mb/sec).

So at least fivefold faster than typical Windows benchmarking tools! And this is with a single thread, no async, no complex code, it's actually a very simple routine, though with a fair bit of overhead spent on retaining the timing of each measurement so a pure performance benchmark might outperform it.


The above was for non-cached, direct i/o. So how does i/o via the kernel page cache fare? This is for cached, write-back i/o:

50% of random 4Kb reads @ QD1 complete within 379 microseconds.
95% of random 4Kb reads @ QD1 complete within 534 microseconds.
99.999% of random 4Kb reads @ QD1 complete within 2,892 microseconds.
Average random 4Kb reads @ QD1 complete within 365 microseconds (approx 2,700 IOPS or 10Mb/sec).

50% of random 4Kb writes @ QD1 complete within 304 microseconds.
95% of random 4Kb writes @ QD1 complete within 611 microseconds.
99.999% of random 4Kb writes @ QD1 complete within 82,703 microseconds.
Average random 4Kb writes @ QD1 complete within 286 microseconds (approx 3,500 IOPS or 13Mb/sec).

That makes direct i/o to this Apple NVMe SSD a full 4.4x faster for reads and 5.3x faster for writes than i/o to the kernel! Now, that's very unusual, historically that certainly was never the case, but these NVMe SSDs are a whole new ball game. Unlike the kernel which must call memcpy(4Kb), NVMe can just go ahead and DMA that entire 4Kb in a single PCIe cycle which costs about 0.5 microseconds. Which this Apple SSD achieves at least 50% of the time, as you can see above. The CPU, on the other hand, can sometimes memcpy(4Kb) in just 4 microseconds if it's done entirely in L2 cache, but most of the time it needs to go allocate a cache page, fill it with i/o and copy it, and that would appear to cost 350 microseconds or so.

Ordinarily I advise anyone who asks to stick with cached i/o. Most people cannot outperform the highly tuned kernel i/o caching algorithms which work well most of the time. But with high end NVMe SSDs I think I'm going to have to change my advice: it's actually much quicker to do small working set i/o (e.g. 1Mb) straight from the SSDs RAM cache than to use kernel i/o. You basically run at PCIe latency which is about L2 cache latency for 4Kb. Much faster than main memory from the CPU.

And that's like wow. Such a game changer. Especially as those Optane SSDs coming to replace Flash SSDs deliver sub-10 microsecond latency @ 95% and sub-60 microsecond latency @ 99.999%.