Thursday 10th April 2014 7.47pm

Thu, Apr 10, 2014

So, the parts for my replacement cloud server finally arrived - I had to order the Supermicro X10SL7-F motherboard from the US and import it to Ireland as the only places I could find it reasonably priced in Europe were in Germany, and none of those companies would deduct VAT for my company. The X10SL7-F is quite literally the only motherboard in its niche - high end Haswell home servers - so I had little choice in the selection unfortunately (or cost actually, but in fairness it's quarter the cost of other Xeon server solutions of a similar spec).

Anyway, I plugged it all together last night with 16Gb of ECC RAM and an Intel Xeon E3-1230 v3 CPU, fired it up and yeah, what's not to like - Aspeed KVM over IP integrated so I never need plug in a monitor or keyboard ever again (or physically hit the power or reset buttons :) ), two 10Gbit ethernet ports which saves me the expansion card I had before, and no less than fourteen SATA connectors with eight of those hanging from a LSI SAS2308 controller which is Intel VT-d virtualisable i.e. you can pass it through at full performance to a virtualised FreeNAS VM. And of course the thing is an absolute beast compared to the $35 Sandy Bridge Pentium G530 I had in there before, as well it should be given the price difference.

But then came an interesting problem: the cloud box runs Proxmox as the VM hypervisor, and that's a fairly ancient 2.6.32 Linux kernel which of course doesn't understand Haswell CPUs, so as dmesg reports it disables the intel_idle power management driver and falls back onto acpi_idle. Yet, this is what ACPI power states the kernel uses:

ed@milla:~$ cat /proc/acpi/processor/CPU0/power
active state:            C0
max_cstate:              C8
maximum allowed latency: 2000000000 usec
states:
    C1:                  type[C1] promotion[–] demotion[–] latency[001] usage[00494647] duration[00000000000000000000]
    C2:                  type[C2] promotion[–] demotion[–] latency[148] usage[01379614] duration[00000000064459801669]

[Yes, Haswells add a C8 low power state, but you will need a suitable PSU capable of going from hundreds of watts to fractions of watts in microseconds, and mine is too old. Mine is new enough to support C7 though, and that draws only a watt, and rather handily Supermicro have let you choose if C7 is available in the BIOS rather than just disabling it outright for improved compatibilty with older PSUs]

Anyway, the curious thing above is that Linux's default ondemand CPU governor doesn't bother changing the clock speed of the CPU on Haswell because the C-states reported by ACPI above all say to run at the full 3.3Ghz - my big question was whether this means I have misconfigured the BIOS or something? Much trial and error later, I came to the conclusion that no I haven't. So will running this Haswell server with a kernel not capable of power saving result in a big electricity bill or not?

So, here are the empirical results for whole system power draw (which includes four 3Tb WD Red hard drives @ 4w each):

Idle: 49.4w
1x 'cat /dev/urandom > /dev/null &': 66.4w
2x 'cat /dev/urandom > /dev/null &': 75.4w
3x 'cat /dev/urandom > /dev/null &': 82.9w
4x 'cat /dev/urandom > /dev/null &': 87.5w
8x 'cat /dev/urandom > /dev/null &': 91.0w

Well that certainly looks about right … even with that SAS2308 controller adding 10w and 16w for the hard drives, that leaves 23.4w for the rest of the system, and the internet says it should draw no less than 21w. It looks pretty close to optimal, and that makes no sense!

Anyway, after a lot of head scratching I finally figured out the answer - it turns out that Haswell, unlike Ivy Bridge, does its own ondemand governor internally and doesn't need the OS to do it for it, so while the OS thinks it is running the CPU full belt, the CPU is actually doing its own power management. Witness the following thanks to the i7z tool (http://code.google.com/p/i7z/, you'll need to grab it from trunk for Haswell support):

Socket [0] - [physical cores=4, logical cores=8, max online cores ever=4]
TURBO ENABLED on 4 Cores, Hyper Threading ON
Max Frequency without considering Turbo 3390.73 MHz (99.73 x [34])
Max TURBO Multiplier (if Enabled) with 1/2/3/4 Cores is 37x/37x/36x/35x
Real Current Frequency 3176.12 MHz 99.73 x 31.85
        Core [core-id] :Actual Freq (Mult.)      C0%   Halt(C1)% C3 %   C6 %   C7 % Temp      VCore
        Core 1 [0]:       3176.12 (31.85x)         1    2.49       0       0    97.3    25      0.9999
        Core 2 [1]:       3063.24 (30.72x)         1    15.8       0       0    84.1    26      0.9974
        Core 3 [2]:       2825.56 (28.33x)         1    3.26       0       0    96.7    29      0.9974
        Core 4 [3]:       2645.90 (26.53x)         1    4.36       0       0    95.6    26      0.9974

C0 = Processor running without halting
C1 = Processor running with halts (States >C0 are power saver modes with cores idling)
C3 = Cores running with PLL turned off and core cache turned off
C6, C7 = Everything in C3 + core state saved to last level cache, C7 is deeper than C6
Above values in table are in percentage over the last 1 sec
[core-id] refers to core-id number in /proc/cpuinfo
'Garbage Values' message printed when garbage values are read
Ctrl+C to exit

Note the reduced voltage (VCore) and that the clocks are all much lower. Let's try one thread of work:

Socket [0] - [physical cores=4, logical cores=8, max online cores ever=4]
TURBO ENABLED on 4 Cores, Hyper Threading ON
Max Frequency without considering Turbo 3390.73 MHz (99.73 x [34])
Max TURBO Multiplier (if Enabled) with 1/2/3/4 Cores is 37x/37x/36x/35x
Real Current Frequency 3648.67 MHz 99.73 x 36.59
        Core [core-id] :Actual Freq (Mult.)      C0%   Halt(C1)% C3 %   C6 %   C7 % Temp      VCore
        Core 1 [0]:       3643.07 (36.53x)         1    0.313      0       0    99.4    34      1.0438
        Core 2 [1]:       3609.90 (36.20x)         1     1.1       0       0    98.8    34      1.0461
        Core 3 [2]:       3648.67 (36.59x)         1     100       0       0       0    46      1.0414
        Core 4 [3]:       3607.95 (36.18x)         1    6.01       0       0      94    39      1.0413

Hmm, voltage and clock speed has jumped into turbo land, now it claims that one thread is always being C1 halted which is surely not true whilst everything else remains in C7. Let's try four threads:

Socket [0] - [physical cores=4, logical cores=8, max online cores ever=4]
TURBO ENABLED on 4 Cores, Hyper Threading ON
Max Frequency without considering Turbo 3390.73 MHz (99.73 x [34])
Max TURBO Multiplier (if Enabled) with 1/2/3/4 Cores is 37x/37x/36x/35x
Real Current Frequency 3490.45 MHz 99.73 x 35.00
        Core [core-id] :Actual Freq (Mult.)      C0%   Halt(C1)% C3 %   C6 %   C7 % Temp      VCore
        Core 1 [0]:       3490.45 (35.00x)         1    99.5       0       0       0    51      1.0194
        Core 2 [1]:       3490.45 (35.00x)       100       0       0       0       0    49      1.0219
        Core 3 [2]:       3489.02 (34.99x)         1    99.6       0       0       0    54      1.0219
        Core 4 [3]:       3488.69 (34.98x)         1    99.6       0       0       0    48      1.0219

Suddenly the C1 halted state when one core is 100% in use makes sense - it's the Hyperthreading, because a single thread at 100% isn't actually using the whole CPU from the CPU's perspective and is in fact halting. If you run eight threads, you get all four cores at C0, as to be expected.

So what should we learn from this? Well, I did quite a few experiments with manually forcing the clock speed down to 800Mhz, or using usermode CPU governors, or forcing kernel CPU governors - none made a jot of difference to idle power consumption except to raise it, plus they did stop the CPU from utilising turbo frequencies.

I therefore believe that it is actually the case that with Haswell doing no power consumption management at all is in fact ideal for this CPU, or at least until my Linux kernel gains the Intel p-states driver, and even then I understand that is more for reducing power consumption under load not idle.

So there you go eh? Who would have thought we've come full circle after ten years of active CPU power management!