Why are some Intel family 6 CPU models (Core 2, Pentium M) not supported by intel_idle?
Clash Royale CLAN TAG#URR8PPP
up vote
23
down vote
favorite
I've been tuning my Linux kernel for Intel Core 2 Quad (Yorkfield) processors, and I noticed the following messages from dmesg
:
[ 0.019526] cpuidle: using governor menu
[ 0.531691] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[ 0.550918] intel_idle: does not run on family 6 model 23
[ 0.554415] tsc: Marking TSC unstable due to TSC halts in idle
PowerTop shows only states C1, C2 and C3 being used for the package and individual cores:
Package | CPU 0
POLL 0.0% | POLL 0.0% 0.1 ms
C1 0.0% | C1 0.0% 0.0 ms
C2 8.2% | C2 9.9% 0.4 ms
C3 84.9% | C3 82.5% 0.9 ms
| CPU 1
| POLL 0.1% 1.6 ms
| C1 0.0% 1.5 ms
| C2 9.6% 0.4 ms
| C3 82.7% 1.0 ms
| CPU 2
| POLL 0.0% 0.1 ms
| C1 0.0% 0.0 ms
| C2 7.2% 0.3 ms
| C3 86.5% 1.0 ms
| CPU 3
| POLL 0.0% 0.1 ms
| C1 0.0% 0.0 ms
| C2 5.9% 0.3 ms
| C3 87.7% 1.0 ms
Curious, I queried sysfs
and found that the legacy acpi_idle
driver was in use (I expected to see the intel_idle
driver):
cat /sys/devices/system/cpu/cpuidle/current_driver
acpi_idle
Looking at the kernel source code, the current intel_idle driver contains a debug message specifically noting that some Intel family 6 models are not supported by the driver:
if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL && boot_cpu_data.x86 == 6)
pr_debug("does not run on family %d model %dn", boot_cpu_data.x86, boot_cpu_data.x86_model);
An earlier fork (November 22, 2010) of intel_idle.c shows anticipated support for Core 2 processors (model 23 actually covers both Core 2 Duo and Quad):
#ifdef FUTURE_USE
case 0x17: /* 23 - Core 2 Duo */
lapic_timer_reliable_states = (1 << 2) | (1 << 1); /* C2, C1 */
#endif
The above code was deleted in December 2010 commit.
Unfortunately, there is almost no documentation in the source code, so there is no explanation regarding the lack of support for the idle function in these CPUs.
My current kernel configuration is as follows:
CONFIG_SMP=y
CONFIG_MCORE2=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_ACPI_PROCESSOR_IDLE=y
CONFIG_CPU_IDLE=y
# CONFIG_CPU_IDLE_GOV_LADDER is not set
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
CONFIG_INTEL_IDLE=y
My question is as follows:
- Is there a specific hardware reason that Core 2 processors are not supported by
intel_idle
? - Is there a more appropriate way to configure a kernel for optimal CPU idle support for this family of processors (aside from disabling support for
intel_idle
)?
linux drivers cpu power-management intel
add a comment |Â
up vote
23
down vote
favorite
I've been tuning my Linux kernel for Intel Core 2 Quad (Yorkfield) processors, and I noticed the following messages from dmesg
:
[ 0.019526] cpuidle: using governor menu
[ 0.531691] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[ 0.550918] intel_idle: does not run on family 6 model 23
[ 0.554415] tsc: Marking TSC unstable due to TSC halts in idle
PowerTop shows only states C1, C2 and C3 being used for the package and individual cores:
Package | CPU 0
POLL 0.0% | POLL 0.0% 0.1 ms
C1 0.0% | C1 0.0% 0.0 ms
C2 8.2% | C2 9.9% 0.4 ms
C3 84.9% | C3 82.5% 0.9 ms
| CPU 1
| POLL 0.1% 1.6 ms
| C1 0.0% 1.5 ms
| C2 9.6% 0.4 ms
| C3 82.7% 1.0 ms
| CPU 2
| POLL 0.0% 0.1 ms
| C1 0.0% 0.0 ms
| C2 7.2% 0.3 ms
| C3 86.5% 1.0 ms
| CPU 3
| POLL 0.0% 0.1 ms
| C1 0.0% 0.0 ms
| C2 5.9% 0.3 ms
| C3 87.7% 1.0 ms
Curious, I queried sysfs
and found that the legacy acpi_idle
driver was in use (I expected to see the intel_idle
driver):
cat /sys/devices/system/cpu/cpuidle/current_driver
acpi_idle
Looking at the kernel source code, the current intel_idle driver contains a debug message specifically noting that some Intel family 6 models are not supported by the driver:
if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL && boot_cpu_data.x86 == 6)
pr_debug("does not run on family %d model %dn", boot_cpu_data.x86, boot_cpu_data.x86_model);
An earlier fork (November 22, 2010) of intel_idle.c shows anticipated support for Core 2 processors (model 23 actually covers both Core 2 Duo and Quad):
#ifdef FUTURE_USE
case 0x17: /* 23 - Core 2 Duo */
lapic_timer_reliable_states = (1 << 2) | (1 << 1); /* C2, C1 */
#endif
The above code was deleted in December 2010 commit.
Unfortunately, there is almost no documentation in the source code, so there is no explanation regarding the lack of support for the idle function in these CPUs.
My current kernel configuration is as follows:
CONFIG_SMP=y
CONFIG_MCORE2=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_ACPI_PROCESSOR_IDLE=y
CONFIG_CPU_IDLE=y
# CONFIG_CPU_IDLE_GOV_LADDER is not set
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
CONFIG_INTEL_IDLE=y
My question is as follows:
- Is there a specific hardware reason that Core 2 processors are not supported by
intel_idle
? - Is there a more appropriate way to configure a kernel for optimal CPU idle support for this family of processors (aside from disabling support for
intel_idle
)?
linux drivers cpu power-management intel
add a comment |Â
up vote
23
down vote
favorite
up vote
23
down vote
favorite
I've been tuning my Linux kernel for Intel Core 2 Quad (Yorkfield) processors, and I noticed the following messages from dmesg
:
[ 0.019526] cpuidle: using governor menu
[ 0.531691] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[ 0.550918] intel_idle: does not run on family 6 model 23
[ 0.554415] tsc: Marking TSC unstable due to TSC halts in idle
PowerTop shows only states C1, C2 and C3 being used for the package and individual cores:
Package | CPU 0
POLL 0.0% | POLL 0.0% 0.1 ms
C1 0.0% | C1 0.0% 0.0 ms
C2 8.2% | C2 9.9% 0.4 ms
C3 84.9% | C3 82.5% 0.9 ms
| CPU 1
| POLL 0.1% 1.6 ms
| C1 0.0% 1.5 ms
| C2 9.6% 0.4 ms
| C3 82.7% 1.0 ms
| CPU 2
| POLL 0.0% 0.1 ms
| C1 0.0% 0.0 ms
| C2 7.2% 0.3 ms
| C3 86.5% 1.0 ms
| CPU 3
| POLL 0.0% 0.1 ms
| C1 0.0% 0.0 ms
| C2 5.9% 0.3 ms
| C3 87.7% 1.0 ms
Curious, I queried sysfs
and found that the legacy acpi_idle
driver was in use (I expected to see the intel_idle
driver):
cat /sys/devices/system/cpu/cpuidle/current_driver
acpi_idle
Looking at the kernel source code, the current intel_idle driver contains a debug message specifically noting that some Intel family 6 models are not supported by the driver:
if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL && boot_cpu_data.x86 == 6)
pr_debug("does not run on family %d model %dn", boot_cpu_data.x86, boot_cpu_data.x86_model);
An earlier fork (November 22, 2010) of intel_idle.c shows anticipated support for Core 2 processors (model 23 actually covers both Core 2 Duo and Quad):
#ifdef FUTURE_USE
case 0x17: /* 23 - Core 2 Duo */
lapic_timer_reliable_states = (1 << 2) | (1 << 1); /* C2, C1 */
#endif
The above code was deleted in December 2010 commit.
Unfortunately, there is almost no documentation in the source code, so there is no explanation regarding the lack of support for the idle function in these CPUs.
My current kernel configuration is as follows:
CONFIG_SMP=y
CONFIG_MCORE2=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_ACPI_PROCESSOR_IDLE=y
CONFIG_CPU_IDLE=y
# CONFIG_CPU_IDLE_GOV_LADDER is not set
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
CONFIG_INTEL_IDLE=y
My question is as follows:
- Is there a specific hardware reason that Core 2 processors are not supported by
intel_idle
? - Is there a more appropriate way to configure a kernel for optimal CPU idle support for this family of processors (aside from disabling support for
intel_idle
)?
linux drivers cpu power-management intel
I've been tuning my Linux kernel for Intel Core 2 Quad (Yorkfield) processors, and I noticed the following messages from dmesg
:
[ 0.019526] cpuidle: using governor menu
[ 0.531691] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[ 0.550918] intel_idle: does not run on family 6 model 23
[ 0.554415] tsc: Marking TSC unstable due to TSC halts in idle
PowerTop shows only states C1, C2 and C3 being used for the package and individual cores:
Package | CPU 0
POLL 0.0% | POLL 0.0% 0.1 ms
C1 0.0% | C1 0.0% 0.0 ms
C2 8.2% | C2 9.9% 0.4 ms
C3 84.9% | C3 82.5% 0.9 ms
| CPU 1
| POLL 0.1% 1.6 ms
| C1 0.0% 1.5 ms
| C2 9.6% 0.4 ms
| C3 82.7% 1.0 ms
| CPU 2
| POLL 0.0% 0.1 ms
| C1 0.0% 0.0 ms
| C2 7.2% 0.3 ms
| C3 86.5% 1.0 ms
| CPU 3
| POLL 0.0% 0.1 ms
| C1 0.0% 0.0 ms
| C2 5.9% 0.3 ms
| C3 87.7% 1.0 ms
Curious, I queried sysfs
and found that the legacy acpi_idle
driver was in use (I expected to see the intel_idle
driver):
cat /sys/devices/system/cpu/cpuidle/current_driver
acpi_idle
Looking at the kernel source code, the current intel_idle driver contains a debug message specifically noting that some Intel family 6 models are not supported by the driver:
if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL && boot_cpu_data.x86 == 6)
pr_debug("does not run on family %d model %dn", boot_cpu_data.x86, boot_cpu_data.x86_model);
An earlier fork (November 22, 2010) of intel_idle.c shows anticipated support for Core 2 processors (model 23 actually covers both Core 2 Duo and Quad):
#ifdef FUTURE_USE
case 0x17: /* 23 - Core 2 Duo */
lapic_timer_reliable_states = (1 << 2) | (1 << 1); /* C2, C1 */
#endif
The above code was deleted in December 2010 commit.
Unfortunately, there is almost no documentation in the source code, so there is no explanation regarding the lack of support for the idle function in these CPUs.
My current kernel configuration is as follows:
CONFIG_SMP=y
CONFIG_MCORE2=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_ACPI_PROCESSOR_IDLE=y
CONFIG_CPU_IDLE=y
# CONFIG_CPU_IDLE_GOV_LADDER is not set
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
CONFIG_INTEL_IDLE=y
My question is as follows:
- Is there a specific hardware reason that Core 2 processors are not supported by
intel_idle
? - Is there a more appropriate way to configure a kernel for optimal CPU idle support for this family of processors (aside from disabling support for
intel_idle
)?
linux drivers cpu power-management intel
edited Jul 14 at 15:28
asked Jul 12 at 13:12
vallismortis
3781216
3781216
add a comment |Â
add a comment |Â
3 Answers
3
active
oldest
votes
up vote
22
down vote
accepted
While researching Core 2 CPU power states ("C-states"), I actually managed to implement support for most of the legacy Intel Core/Core 2 processors. The complete implementation (Linux patch) with all of the background information is documented here.
As I accumulated more information about these processors, it started to become apparent that the C-states supported in the Core 2 model(s) are far more complex than those in both earlier and later processors. These are known as Enhanced C-states (or "CxE"), which involve the package, individual cores and other components on the chipset (e.g., memory). At the time the intel_idle
driver was released, the code was not particularly mature and several Core 2 processors had been released that had conflicting C-state support.
Some compelling information on Core 2 Solo/Duo C-state support was found in this article from 2006. This is in relation to support on Windows, however it does indicate the robust hardware C-state support on these processors. The information regarding Kentsfield conflicts with the actual model number, so I believe they are actually referring to a Yorkfield below:
...the quad-core Intel Core 2 Extreme (Kentsfield) processor supports
all five performance and power saving technologies â Enhanced Intel
SpeedStep (EIST), Thermal Monitor 1 (TM1) and Thermal Monitor 2 (TM2),
old On-Demand Clock Modulation (ODCM), as well as Enhanced C States
(CxE). Compared to Intel Pentium 4 and Pentium D 600, 800, and 900
processors, which are characterized only by Enhanced Halt (C1) State,
this function has been expanded in Intel Core 2 processors (as well as
Intel Core Solo/Duo processors) for all possible idle states of a
processor, including Stop Grant (C2), Deep Sleep (C3), and Deeper
Sleep (C4).
This article from 2008 outlines support for per-core C-states on multi-core Intel processors, including Core 2 Duo and Core 2 Quad (additional helpful background reading was found in this white paper from Dell):
A core C-state is a hardware C-state. There are several core idle
states, e.g. CC1 and CC3. As we know, a modern state of the art
processor has multiple cores, such as the recently released Core Duo
T5000/T7000 mobile processors, known as Penryn in some circles. What
we used to think of as a CPU / processor, actually has multiple
general purpose CPUs in side of it. The Intel Core Duo has 2 cores in
the processor chip. The Intel Core-2 Quad has 4 such cores per
processor chip. Each of these cores has its own idle state. This makes
sense as one core might be idle while another is hard at work on a
thread. So a core C-state is the idle state of one of those cores.
I found a 2010 presentation from Intel that provides some additional background about the intel_idle
driver, but unfortunately does not explain the lack of support for Core 2:
This EXPERIMENTAL driver supersedes acpi_idle on Intel Atom
Processors, Intel Core i3/i5/i7 Processors and associated Intel Xeon
processors. It does not support the Intel Core2 processor or earlier.
The above presentation does indicate that the intel_idle
driver is an implementation of the "menu" CPU governor, which has an impact on Linux kernel configuration (i.e., CONFIG_CPU_IDLE_GOV_LADDER
vs. CONFIG_CPU_IDLE_GOV_MENU
). The differences between the ladder and menu governors are succinctly described in this answer.
Dell has a helpful article that lists C-state C0 to C6 compatibility:
Modes C1 to C3 work by basically cutting clock signals used inside the
CPU, while modes C4 to C6 work by reducing the CPU voltage. "Enhanced"
modes can do both at the same time.
Mode Name CPUs
C0 Operating State All CPUs
C1 Halt 486DX4 and above
C1E Enhanced Halt All socket LGA775 CPUs
C1E â Turion 64, 65-nm Athlon X2 and Phenom CPUs
C2 Stop Grant 486DX4 and above
C2 Stop Clock Only 486DX4, Pentium, Pentium MMX, K5, K6, K6-2, K6-III
C2E Extended Stop Grant Core 2 Duo and above (Intel only)
C3 Sleep Pentium II, Athlon and above, but not on Core 2 Duo E4000 and E6000
C3 Deep Sleep Pentium II and above, but not on Core 2 Duo E4000 and E6000; Turion 64
C3 AltVID AMD Turion 64
C4 Deeper Sleep Pentium M and above, but not on Core 2 Duo E4000 and E6000 series; AMD Turion 64
C4E/C5 Enhanced Deeper Sleep Core Solo, Core Duo and 45-nm mobile Core 2 Duo only
C6 Deep Power Down 45-nm mobile Core 2 Duo only
From this table (which I later found to be incorrect in some cases), it appears that there were a variety of differences in C-state support with the Core 2 processors (Note that nearly all Core 2 processors are Socket LGA775, except for Core 2 Solo SU3500, which is Socket BGA956 and Merom/Penryn processors. "Intel Core" Solo/Duo processors are one of Socket PBGA479 or PPGA478).
An additional exception to the table was found in this article:
IntelâÂÂs Core 2 Duo E8500 supports C-states C2 and C4, while the Core 2
Extreme QX9650 does not.
Interestingly, the QX9650 is a Yorkfield processor (Intel family 6, model 23, stepping 6). For reference, my Q9550S is Intel family 6, model 23 (0x17), stepping 10, which supposedly supports C-state C4 (confirmed through experimentation). Additionally, the Core 2 Solo U3500 has an identical CPUID (family, model, stepping) to the Q9550S but is available in a non-LGA775 socket, which confounds interpretation of the above table.
Clearly, the CPUID must be used at least down to the stepping in order to identify C-state support for this model of processor, and in some cases that may be insufficient (undetermined at this time).
The method signature for assigning CPU idle information is:
#define ICPU(model, cpu)
X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, (unsigned long)&cpu
Where model
is enumerated in asm/intel-family.h. Examining this header file, I see that Intel CPUs are assigned 8-bit identifiers that appear to match the Intel family 6 model numbers:
#define INTEL_FAM6_CORE2_PENRYN 0x17
From the above, we have Intel Family 6, Model 23 (0x17) defined as INTEL_FAM6_CORE2_PENRYN
. This should be sufficient for defining idle states for most of the Model 23 processors, but could potentially cause issues with QX9650 as noted above.
So, minimally, each group of processors that has a distinct C-state set would need to be defined in this list.
Zagacki and Ponnala, Intel Technology Journal 12(3):219-227, 2008 indicate that Yorkfield processors do indeed support C2 and C4. They also seem to indicate that the ACPI 3.0a specification supports transitions only between C-states C0, C1, C2 and C3, which I presume may also limit the Linux acpi_idle
driver to transitions between that limited set of C-states. However, this article indicates that may not always be the case:
Bear in mind that is the ACPI C state, not the processor one, so ACPI
C3 might be HW C6, etc.
Also of note:
Beyond the processor itself, since C4 is a synchronized effort between
major silicon components in the platform, the Intel Q45 Express
Chipset achieves a 28-percent power improvement.
The chipset I'm using is indeed an Intel Q45 Express Chipset.
The Intel documentation on MWAIT states is terse but confirms the BIOS-specific ACPI behavior:
The processor-specific C-states defined in MWAIT extensions can map to
ACPI defined C-state types (C0, C1, C2, C3). The mapping relationship
depends on the definition of a C-state by processor implementation and
is exposed to OSPM by the BIOS using the ACPI defined _CST table.
My interpretation of the above table (combined with a table from Wikipedia, asm/intel-family.h and the above articles) is:
Model 9 0x09 (Pentium M and Celeron M):
- Banias: C0, C1, C2, C3, C4
Model 13 0x0D (Pentium M and Celeron M):
- Dothan, Stealey: C0, C1, C2, C3, C4
Model 14 0x0E INTEL_FAM6_CORE_YONAH (Enhanced Pentium M, Enhanced Celeron M or Intel Core):
- Yonah (Core Solo, Core Duo): C0, C1, C2, C3, C4, C4E/C5
Model 15 0x0F INTEL_FAM6_CORE2_MEROM (some Core 2 and Pentium Dual-Core):
- Kentsfield, Merom, Conroe, Allendale (E2xxx/E4xxx and Core 2 Duo E6xxx, T7xxxx/T8xxxx, Core 2 Extreme QX6xxx, Core 2 Quad Q6xxx): C0, C1, C1E, C2, C2E
Model 23 0x17 INTEL_FAM6_CORE2_PENRYN (Core 2):
- Merom-L/Penryn-L: ?
- Penryn (Core 2 Duo 45-nm mobile): C0, C1, C1E, C2, C2E, C3, C4, C4E/C5, C6
- Yorkfield (Core 2 Extreme QX9650): C0, C1, C1E, C2E?, C3
- Wolfdale/Yorkfield (Core 2 Quad, C2Q Xeon, Core 2 Duo E5xxx/E7xxx/E8xxx, Pentium Dual-Core E6xxx, Celeron Dual-Core): C0, C1, C1E, C2, C2E, C3, C4
From the amount of diversity in C-state support within just the Core 2 line of processors, it appears that a lack of consistent support for C-states may have been the reason for not attempting to fully support them via the intel_idle
driver. I would like to fully complete the above list for the entire Core 2 line.
This is not really a satisfying answer, because it makes me wonder how much unnecessary power is used and excess heat has been (and still is) generated by not fully utilizing the robust power-saving MWAIT C-states on these processors.
Chattopadhyay et al. 2018, Energy Efficient High Performance Processors: Recent Approaches for Designing Green High Performance Computing is worth noting for the specific behavior I'm looking for in the Q45 Express Chipset:
Package C-state (PC0-PC10) - When the compute domains, Core and
Graphics (GPU) are idle, the processor has an opportunity for
additional power savings at uncore and platform levels, for example,
flushing the LLC and power-gating the memory controller and DRAM IO,
and at some state, the whole processor can be turned off while its
state is preserved on always-on power domain.
As a test, I inserted the following at linux/drivers/idle/intel_idle.c line 127:
static struct cpuidle_state conroe_cstates =
.name = "C1",
.desc = "MWAIT 0x00",
.flags = MWAIT2flg(0x00),
.exit_latency = 3,
.target_residency = 6,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C1E",
.desc = "MWAIT 0x01",
.flags = MWAIT2flg(0x01),
.exit_latency = 10,
.target_residency = 20,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
//
// .name = "C2",
// .desc = "MWAIT 0x10",
// .flags = MWAIT2flg(0x10),
// .exit_latency = 20,
// .target_residency = 40,
// .enter = &intel_idle,
// .enter_s2idle = intel_idle_s2idle, ,
.name = "C2E",
.desc = "MWAIT 0x11",
.flags = MWAIT2flg(0x11),
.exit_latency = 40,
.target_residency = 100,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.enter = NULL
;
static struct cpuidle_state core2_cstates =
.name = "C1",
.desc = "MWAIT 0x00",
.flags = MWAIT2flg(0x00),
.exit_latency = 3,
.target_residency = 6,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C1E",
.desc = "MWAIT 0x01",
.flags = MWAIT2flg(0x01),
.exit_latency = 10,
.target_residency = 20,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C2",
.desc = "MWAIT 0x10",
.flags = MWAIT2flg(0x10),
.exit_latency = 20,
.target_residency = 40,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C2E",
.desc = "MWAIT 0x11",
.flags = MWAIT2flg(0x11),
.exit_latency = 40,
.target_residency = 100,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C3",
.desc = "MWAIT 0x20",
.flags = MWAIT2flg(0x20) ,
.name = "C4",
.desc = "MWAIT 0x30",
.flags = MWAIT2flg(0x30) ,
CPUIDLE_FLAG_TLB_FLUSHED,
.exit_latency = 100,
.target_residency = 400,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
CPUIDLE_FLAG_TLB_FLUSHED,
.exit_latency = 200,
.target_residency = 800,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.enter = NULL
;
at intel_idle.c
line 983:
static const struct idle_cpu idle_cpu_conroe =
.state_table = conroe_cstates,
.disable_promotion_to_c1e = false,
;
static const struct idle_cpu idle_cpu_core2 =
.state_table = core2_cstates,
.disable_promotion_to_c1e = false,
;
at intel_idle.c
line 1073:
ICPU(INTEL_FAM6_CORE2_MEROM, idle_cpu_conroe),
ICPU(INTEL_FAM6_CORE2_PENRYN, idle_cpu_core2),
After a quick compile and reboot of my PXE nodes, dmesg
now shows:
[ 0.019845] cpuidle: using governor menu
[ 0.515785] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[ 0.543404] intel_idle: MWAIT substates: 0x22220
[ 0.543405] intel_idle: v0.4.1 model 0x17
[ 0.543413] tsc: Marking TSC unstable due to TSC halts in idle states deeper than C2
[ 0.543680] intel_idle: lapic_timer_reliable_states 0x2
And now PowerTOP is showing:
Package | CPU 0
POLL 2.5% | POLL 0.0% 0.0 ms
C1E 2.9% | C1E 5.0% 22.4 ms
C2 0.4% | C2 0.2% 0.2 ms
C3 2.1% | C3 1.9% 0.5 ms
C4E 89.9% | C4E 92.6% 66.5 ms
| CPU 1
| POLL 10.0% 400.8 ms
| C1E 5.1% 6.4 ms
| C2 0.3% 0.1 ms
| C3 1.4% 0.6 ms
| C4E 76.8% 73.6 ms
| CPU 2
| POLL 0.0% 0.2 ms
| C1E 1.1% 3.7 ms
| C2 0.2% 0.2 ms
| C3 3.9% 1.3 ms
| C4E 93.1% 26.4 ms
| CPU 3
| POLL 0.0% 0.7 ms
| C1E 0.3% 0.3 ms
| C2 1.1% 0.4 ms
| C3 1.1% 0.5 ms
| C4E 97.0% 45.2 ms
I've finally accessed the Enhanced Core 2 C-states, and now that my cluster is rebooted the fan noise has dropped to almost nothing in this room. And it looks like there is a measurable drop in power consumption - my meter on 8 nodes appears to be averaging at least 5% lower (with one node still running the old kernel), but I'll try swapping the kernels out again as a test.
An interesting note regarding C4E support - My Yorktown Q9550S processor appears to support it (or some other sub-state of C4), as evidenced above! This confuses me, because the Intel datasheet on the Core 2 Q9000 processor (section 6.2) only mentions C-states Normal (C0), HALT (C1 = 0x00), Extended HALT (C1E = 0x01), Stop Grant (C2 = 0x10), Extended Stop Grant (C2E = 0x11), Sleep/Deep Sleep (C3 = 0x20) and Deeper Sleep (C4 = 0x30). What is this additional 0x31 state? If I enable state C2, then C4E is used instead of C4. If I disable state C2 (force state C2E) then C4 is used instead of C4E. I suspect this may have something to do with the MWAIT flags, but I haven't yet found documentation for this behavior.
I'm not certain what to make of this: The C1E state appears to be used in lieu of C1, C2 is used in lieu of C2E and C4E is used in lieu of C4. I'm uncertain if C1/C1E, C2/C2E and C4/C4E can be used together with intel_idle
or if they are redundant. I found a note in this 2010 presentation by Intel Labs Pittsburgh that indicates the transitions are C0 - C1 - C0 - C1E - C0, and further states:
C1E is only used when all the cores are in C1E
I believe that is to be interpreted as the C1E state is entered on other components (e.g. memory) only when all cores are in the C1E state. I also take this to apply equivalently to the C2/C2E and C4/C4E states (Although C4E is referred to as "C4E/C5" so I'm uncertain if C4E is a sub-state of C4 or if C5 is a sub-state of C4E. Testing seems to indicate C4/C4E is correct). I can force C2E to be used by commenting out the C2 state - however, this causes the C4 state to be used instead of C4E (more work may be required here). Hopefully there aren't any model 15 or model 23 processors that lack state C2E, because those processors would be limited to C1/C1E with the above code.
Also, the flags, latency and residency values could probably stand to be fine-tuned, but just taking educated guesses based on the Nehalem idle values seems to work fine. More reading will be required to make any improvements.
I tested this on a Core 2 Duo E2220 (Allendale), a Dual Core Pentium E5300 (Wolfdale), Core 2 Duo E7400, Core 2 Duo E8400 (Wolfdale), Core 2 Quad Q9550S (Yorkfield) and Core 2 Extreme QX9650, and I have found no issues beyond the afore-mentioned preference for state C2/C2E and C4/C4E.
Not covered by this driver modification:
- The original Core Solo/Core Duo (Yonah, non Core 2) are family 6, model 14. This is good because they supported the C4E/C5 (Enhanced Deep Sleep) C-states but not the C1E/C2E states and would need their own idle definition.
The only issues that I can think of are:
- Core 2 Solo SU3300/SU3500 (Penryn-L) are family 6, model 23 and will be detected by this driver. However, they are not Socket LGA775 so they may not support the C1E Enhanced Halt C-state. Likewise for the Core 2 Solo ULV U2100/U2200 (Merom-L). However, the
intel_idle
driver appears to choose the appropriate C1/C1E based on hardware support of the sub-states. - Core 2 Extreme QX9650 (Yorkfield) reportedly does not support C-state C2 or C4. I have confirmed this by purchasing a used Optiplex 780 and QX9650 Extreme processor on eBay. The processor supports C-states C1 and C1E. With this driver modification, the CPU idles in state C1E instead of C1, so there is presumably some power savings. I expected to see C-state C3, but it is not present when using this driver so I may need to look into this further.
I managed to find a slide from a 2009 Intel presentation on the transitions between C-states (i.e., Deep Power Down):
In conclusion, it turns out that there was no real reason for the lack of Core 2 support in the intel_idle
driver. It is clear now that the original stub code for "Core 2 Duo" only handled C-states C1 and C2, which would have been far less efficient than the acpi_idle
function which also handles C-state C3. Once I knew where to look, implementing support was easy. The helpful comments and other answers were much appreciated, and if Amazon is listening, you know where to send the check.
This update has been committed to github. I will e-mail a patch to the LKML soon.
Update: I also managed to dig up a Socket T/LGA775 Allendale (Conroe) Core 2 Duo E2220, which is family 6, model 15, so I added support for that as well. This model lacks support for C-state C4, but supports C1/C1E and C2/C2E. This should also work for other Conroe-based chips (E4xxx/E6xxx) and possibly all Kentsfield and Merom (non Merom-L) processors.
Update: I finally found some MWAIT tuning resources. This Power vs. Performance writeup and this Deeper C states and increased latency blog post both contain some useful information on identifying CPU idle latencies. Unfortunately, this only reports those exit latencies that were coded into the kernel (but, interestingly, only those hardware states supported by the processor):
# cd /sys/devices/system/cpu/cpu0/cpuidle
# for state in `ls -d state*` ; do echo c-$state `cat $state/name` `cat $state/latency` ; done
c-state0/ POLL 0
c-state1/ C1 3
c-state2/ C1E 10
c-state3/ C2 20
c-state4/ C2E 40
c-state5/ C3 20
c-state6/ C4 60
c-state7/ C4E 100
4
That is nice detective work! I had forgotten how complex the C2D/C2Q C-states were. Re untapped power savings, if your firmware is good enough then you should still be getting the benefit of at least some of the C-states viaacpi_idle
and the various performance governors. What states doespowertop
show on your system?
â Stephen Kitt
Jul 12 at 21:07
1
Very nice information, have you considered proposing your patch to the upstream Linux kernel?
â Lekensteyn
Jul 13 at 9:10
1
"The C1E state appears to be used in lieu of C1..." Which state is used - as shown by powertop - is determined solely by the kernel, therefore I believe it will not "have something to do with the MWAIT flags", it will be chosen solely based on the order of the states and the exit_latency and target_residency. That said, I would be slightly concerned about leave states in the table if they didn't seem to get used when tested... in case those states didn't actually work as expected, and there was some other workload pattern that led to them being used & the unexpected behaviour happening.
â sourcejedi
Jul 15 at 14:13
1
"the transitions are C0 - C1 - C0 - C1E - C0" - I don't think that's a good description of that slide. From the kernel /powertop
point of view, all transitions are either from C0 or to C0. If you're not in C0, you're not running any instructions, therefore the kernel cannot either observe or request any transition between states on that cpu :-). And as you say, the kernel "menu" governor may well e.g. jump straight into C1E, without spending any time in C1 first.
â sourcejedi
Jul 15 at 14:29
1
"just taking educated guesses based on the Nehalem idle values seems to work fine" - note this is not a good way to get your patch accepted upstream :-P, in that the exit latency must not be an underestimate, otherwise I think you will violate PM_QOS_CPU_DMA_LATENCY, which may be set by drivers (or userspace?)
â sourcejedi
Jul 15 at 14:41
 |Â
show 4 more comments
up vote
7
down vote
Is there a more appropriate way to configure a kernel for optimal CPU idle support for this family of processors (aside from disabling support for intel_idle)
You have ACPI enabled, and you've checked that acpi_idle is in use. I sincerely doubt you have missed any helpful kernel config option. You can always check powertop
for possible suggestions, but probably you already knew that.
This is not an answer, but I want to format it :-(.
Looking at the kernel source code, the current intel_idle driver contains a test to specifically exclude Intel family 6 from the driver.
No it doesn't :-).
id = x86_match_cpu(intel_idle_ids);
if (!id)
if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
boot_cpu_data.x86 == 6)
pr_debug(PREFIX "does not run on family %d model %dn",
boot_cpu_data.x86, boot_cpu_data.x86_model);
return -ENODEV;
The if
statement does not exclude Family 6. Instead, the if
statement provides a message when debugging is enabled, that this specific modern Intel CPU is not supported by intel_idle
. In fact, my current i5-5300U CPU is Family 6 and it uses intel_idle
.
What excludes your CPU is that there is no match in the intel_idle_ids
table.
I noticed this commit which implemented the table. The code it removes had a switch
statement instead. This makes it easy to see that the earliest model intel_idle has been implemented/successfully tested/whatever is 0x1A = 26. https://github.com/torvalds/linux/commit/b66b8b9a4a79087dde1b358a016e5c8739ccf186
Thank you for noting the specific test is based on theintel_idle_ids
table - I've adjusted the phrasing of the question, which still stands regarding Core 2/Yorkfield support.
â vallismortis
Jul 12 at 15:06
This article provides additional background and usage information for the PowerTOP command.
â vallismortis
Jul 12 at 20:55
add a comment |Â
up vote
6
down vote
I suspect this could just be a case of opportunity and cost. When intel_idle
was added, it seems Core 2 Duo support was planned, but it never was fully implemented â perhaps by the time the Intel engineers got round to it, it wasnâÂÂt worth it any more. The equation is relatively complex: intel_idle
needs to provide sufficient benefits over acpi_idle
to make it worth supporting here, on CPUs which will see the âÂÂimprovedâ kernel in sufficient numbers...
As sourcejediâÂÂs answer says, the driver doesnâÂÂt exclude all of family 6. The intel_idle
initialisation checks for CPUs in a list of CPU models, covering basically all micro-architectures from Nehalem to Kaby Lake. Yorkfield is older than that (and significantly different â Nehalem is very different from the architectures which came before it). The family 6 test only affects whether the error message is printed; its effect is only that the error message will only be displayed on Intel CPUs, not AMD CPUs (Intel family 6 includes all non-NetBurst Intel CPUs since the Pentium Pro).
To answer your configuration question, you could completely disable intel_idle
, but leaving it in is fine too (as long as you donâÂÂt mind the warning).
pr_debug() message should only appear if you do something very specific to enable that debug message, so you don't even have to ignore the warning
â sourcejedi
Jul 12 at 14:01
2
@sourcejedi I mentioned that because the OP is seeing it.
â Stephen Kitt
Jul 12 at 14:02
gotcha. I present a half-serious comment: since we are asked about a sensible kernel config, if it is used day-to-day, maybe don't use the option that enables all debug messages? With the right option, they can be enabled dynamically and selectively when necessary. kernel.org/doc/html/v4.17/admin-guide/dynamic-debug-howto.html If you enable all debug messages, you probably have lots of messages you are ignoring anyway :).
â sourcejedi
Jul 12 at 14:07
@sourcejedi I fail to see the relevance of your comments regarding disabling kernel messages. I don't see this as being constructive to the question, which specifically addresses Core 2 support for theintel_idle
driver.
â vallismortis
Jul 12 at 15:14
@vallismortis it is very tangential. It means that there is valid configuration you can use for Core 2 and above, which does not print this as an annoying warning message which must simply be ignored, and will use intel_idle if supported... but then I suppose you would use dynamically loaded modules anyway, so maybe not worth mentioning.
â sourcejedi
Jul 12 at 15:19
 |Â
show 1 more comment
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
22
down vote
accepted
While researching Core 2 CPU power states ("C-states"), I actually managed to implement support for most of the legacy Intel Core/Core 2 processors. The complete implementation (Linux patch) with all of the background information is documented here.
As I accumulated more information about these processors, it started to become apparent that the C-states supported in the Core 2 model(s) are far more complex than those in both earlier and later processors. These are known as Enhanced C-states (or "CxE"), which involve the package, individual cores and other components on the chipset (e.g., memory). At the time the intel_idle
driver was released, the code was not particularly mature and several Core 2 processors had been released that had conflicting C-state support.
Some compelling information on Core 2 Solo/Duo C-state support was found in this article from 2006. This is in relation to support on Windows, however it does indicate the robust hardware C-state support on these processors. The information regarding Kentsfield conflicts with the actual model number, so I believe they are actually referring to a Yorkfield below:
...the quad-core Intel Core 2 Extreme (Kentsfield) processor supports
all five performance and power saving technologies â Enhanced Intel
SpeedStep (EIST), Thermal Monitor 1 (TM1) and Thermal Monitor 2 (TM2),
old On-Demand Clock Modulation (ODCM), as well as Enhanced C States
(CxE). Compared to Intel Pentium 4 and Pentium D 600, 800, and 900
processors, which are characterized only by Enhanced Halt (C1) State,
this function has been expanded in Intel Core 2 processors (as well as
Intel Core Solo/Duo processors) for all possible idle states of a
processor, including Stop Grant (C2), Deep Sleep (C3), and Deeper
Sleep (C4).
This article from 2008 outlines support for per-core C-states on multi-core Intel processors, including Core 2 Duo and Core 2 Quad (additional helpful background reading was found in this white paper from Dell):
A core C-state is a hardware C-state. There are several core idle
states, e.g. CC1 and CC3. As we know, a modern state of the art
processor has multiple cores, such as the recently released Core Duo
T5000/T7000 mobile processors, known as Penryn in some circles. What
we used to think of as a CPU / processor, actually has multiple
general purpose CPUs in side of it. The Intel Core Duo has 2 cores in
the processor chip. The Intel Core-2 Quad has 4 such cores per
processor chip. Each of these cores has its own idle state. This makes
sense as one core might be idle while another is hard at work on a
thread. So a core C-state is the idle state of one of those cores.
I found a 2010 presentation from Intel that provides some additional background about the intel_idle
driver, but unfortunately does not explain the lack of support for Core 2:
This EXPERIMENTAL driver supersedes acpi_idle on Intel Atom
Processors, Intel Core i3/i5/i7 Processors and associated Intel Xeon
processors. It does not support the Intel Core2 processor or earlier.
The above presentation does indicate that the intel_idle
driver is an implementation of the "menu" CPU governor, which has an impact on Linux kernel configuration (i.e., CONFIG_CPU_IDLE_GOV_LADDER
vs. CONFIG_CPU_IDLE_GOV_MENU
). The differences between the ladder and menu governors are succinctly described in this answer.
Dell has a helpful article that lists C-state C0 to C6 compatibility:
Modes C1 to C3 work by basically cutting clock signals used inside the
CPU, while modes C4 to C6 work by reducing the CPU voltage. "Enhanced"
modes can do both at the same time.
Mode Name CPUs
C0 Operating State All CPUs
C1 Halt 486DX4 and above
C1E Enhanced Halt All socket LGA775 CPUs
C1E â Turion 64, 65-nm Athlon X2 and Phenom CPUs
C2 Stop Grant 486DX4 and above
C2 Stop Clock Only 486DX4, Pentium, Pentium MMX, K5, K6, K6-2, K6-III
C2E Extended Stop Grant Core 2 Duo and above (Intel only)
C3 Sleep Pentium II, Athlon and above, but not on Core 2 Duo E4000 and E6000
C3 Deep Sleep Pentium II and above, but not on Core 2 Duo E4000 and E6000; Turion 64
C3 AltVID AMD Turion 64
C4 Deeper Sleep Pentium M and above, but not on Core 2 Duo E4000 and E6000 series; AMD Turion 64
C4E/C5 Enhanced Deeper Sleep Core Solo, Core Duo and 45-nm mobile Core 2 Duo only
C6 Deep Power Down 45-nm mobile Core 2 Duo only
From this table (which I later found to be incorrect in some cases), it appears that there were a variety of differences in C-state support with the Core 2 processors (Note that nearly all Core 2 processors are Socket LGA775, except for Core 2 Solo SU3500, which is Socket BGA956 and Merom/Penryn processors. "Intel Core" Solo/Duo processors are one of Socket PBGA479 or PPGA478).
An additional exception to the table was found in this article:
IntelâÂÂs Core 2 Duo E8500 supports C-states C2 and C4, while the Core 2
Extreme QX9650 does not.
Interestingly, the QX9650 is a Yorkfield processor (Intel family 6, model 23, stepping 6). For reference, my Q9550S is Intel family 6, model 23 (0x17), stepping 10, which supposedly supports C-state C4 (confirmed through experimentation). Additionally, the Core 2 Solo U3500 has an identical CPUID (family, model, stepping) to the Q9550S but is available in a non-LGA775 socket, which confounds interpretation of the above table.
Clearly, the CPUID must be used at least down to the stepping in order to identify C-state support for this model of processor, and in some cases that may be insufficient (undetermined at this time).
The method signature for assigning CPU idle information is:
#define ICPU(model, cpu)
X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, (unsigned long)&cpu
Where model
is enumerated in asm/intel-family.h. Examining this header file, I see that Intel CPUs are assigned 8-bit identifiers that appear to match the Intel family 6 model numbers:
#define INTEL_FAM6_CORE2_PENRYN 0x17
From the above, we have Intel Family 6, Model 23 (0x17) defined as INTEL_FAM6_CORE2_PENRYN
. This should be sufficient for defining idle states for most of the Model 23 processors, but could potentially cause issues with QX9650 as noted above.
So, minimally, each group of processors that has a distinct C-state set would need to be defined in this list.
Zagacki and Ponnala, Intel Technology Journal 12(3):219-227, 2008 indicate that Yorkfield processors do indeed support C2 and C4. They also seem to indicate that the ACPI 3.0a specification supports transitions only between C-states C0, C1, C2 and C3, which I presume may also limit the Linux acpi_idle
driver to transitions between that limited set of C-states. However, this article indicates that may not always be the case:
Bear in mind that is the ACPI C state, not the processor one, so ACPI
C3 might be HW C6, etc.
Also of note:
Beyond the processor itself, since C4 is a synchronized effort between
major silicon components in the platform, the Intel Q45 Express
Chipset achieves a 28-percent power improvement.
The chipset I'm using is indeed an Intel Q45 Express Chipset.
The Intel documentation on MWAIT states is terse but confirms the BIOS-specific ACPI behavior:
The processor-specific C-states defined in MWAIT extensions can map to
ACPI defined C-state types (C0, C1, C2, C3). The mapping relationship
depends on the definition of a C-state by processor implementation and
is exposed to OSPM by the BIOS using the ACPI defined _CST table.
My interpretation of the above table (combined with a table from Wikipedia, asm/intel-family.h and the above articles) is:
Model 9 0x09 (Pentium M and Celeron M):
- Banias: C0, C1, C2, C3, C4
Model 13 0x0D (Pentium M and Celeron M):
- Dothan, Stealey: C0, C1, C2, C3, C4
Model 14 0x0E INTEL_FAM6_CORE_YONAH (Enhanced Pentium M, Enhanced Celeron M or Intel Core):
- Yonah (Core Solo, Core Duo): C0, C1, C2, C3, C4, C4E/C5
Model 15 0x0F INTEL_FAM6_CORE2_MEROM (some Core 2 and Pentium Dual-Core):
- Kentsfield, Merom, Conroe, Allendale (E2xxx/E4xxx and Core 2 Duo E6xxx, T7xxxx/T8xxxx, Core 2 Extreme QX6xxx, Core 2 Quad Q6xxx): C0, C1, C1E, C2, C2E
Model 23 0x17 INTEL_FAM6_CORE2_PENRYN (Core 2):
- Merom-L/Penryn-L: ?
- Penryn (Core 2 Duo 45-nm mobile): C0, C1, C1E, C2, C2E, C3, C4, C4E/C5, C6
- Yorkfield (Core 2 Extreme QX9650): C0, C1, C1E, C2E?, C3
- Wolfdale/Yorkfield (Core 2 Quad, C2Q Xeon, Core 2 Duo E5xxx/E7xxx/E8xxx, Pentium Dual-Core E6xxx, Celeron Dual-Core): C0, C1, C1E, C2, C2E, C3, C4
From the amount of diversity in C-state support within just the Core 2 line of processors, it appears that a lack of consistent support for C-states may have been the reason for not attempting to fully support them via the intel_idle
driver. I would like to fully complete the above list for the entire Core 2 line.
This is not really a satisfying answer, because it makes me wonder how much unnecessary power is used and excess heat has been (and still is) generated by not fully utilizing the robust power-saving MWAIT C-states on these processors.
Chattopadhyay et al. 2018, Energy Efficient High Performance Processors: Recent Approaches for Designing Green High Performance Computing is worth noting for the specific behavior I'm looking for in the Q45 Express Chipset:
Package C-state (PC0-PC10) - When the compute domains, Core and
Graphics (GPU) are idle, the processor has an opportunity for
additional power savings at uncore and platform levels, for example,
flushing the LLC and power-gating the memory controller and DRAM IO,
and at some state, the whole processor can be turned off while its
state is preserved on always-on power domain.
As a test, I inserted the following at linux/drivers/idle/intel_idle.c line 127:
static struct cpuidle_state conroe_cstates =
.name = "C1",
.desc = "MWAIT 0x00",
.flags = MWAIT2flg(0x00),
.exit_latency = 3,
.target_residency = 6,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C1E",
.desc = "MWAIT 0x01",
.flags = MWAIT2flg(0x01),
.exit_latency = 10,
.target_residency = 20,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
//
// .name = "C2",
// .desc = "MWAIT 0x10",
// .flags = MWAIT2flg(0x10),
// .exit_latency = 20,
// .target_residency = 40,
// .enter = &intel_idle,
// .enter_s2idle = intel_idle_s2idle, ,
.name = "C2E",
.desc = "MWAIT 0x11",
.flags = MWAIT2flg(0x11),
.exit_latency = 40,
.target_residency = 100,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.enter = NULL
;
static struct cpuidle_state core2_cstates =
.name = "C1",
.desc = "MWAIT 0x00",
.flags = MWAIT2flg(0x00),
.exit_latency = 3,
.target_residency = 6,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C1E",
.desc = "MWAIT 0x01",
.flags = MWAIT2flg(0x01),
.exit_latency = 10,
.target_residency = 20,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C2",
.desc = "MWAIT 0x10",
.flags = MWAIT2flg(0x10),
.exit_latency = 20,
.target_residency = 40,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C2E",
.desc = "MWAIT 0x11",
.flags = MWAIT2flg(0x11),
.exit_latency = 40,
.target_residency = 100,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C3",
.desc = "MWAIT 0x20",
.flags = MWAIT2flg(0x20) ,
.name = "C4",
.desc = "MWAIT 0x30",
.flags = MWAIT2flg(0x30) ,
CPUIDLE_FLAG_TLB_FLUSHED,
.exit_latency = 100,
.target_residency = 400,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
CPUIDLE_FLAG_TLB_FLUSHED,
.exit_latency = 200,
.target_residency = 800,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.enter = NULL
;
at intel_idle.c
line 983:
static const struct idle_cpu idle_cpu_conroe =
.state_table = conroe_cstates,
.disable_promotion_to_c1e = false,
;
static const struct idle_cpu idle_cpu_core2 =
.state_table = core2_cstates,
.disable_promotion_to_c1e = false,
;
at intel_idle.c
line 1073:
ICPU(INTEL_FAM6_CORE2_MEROM, idle_cpu_conroe),
ICPU(INTEL_FAM6_CORE2_PENRYN, idle_cpu_core2),
After a quick compile and reboot of my PXE nodes, dmesg
now shows:
[ 0.019845] cpuidle: using governor menu
[ 0.515785] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[ 0.543404] intel_idle: MWAIT substates: 0x22220
[ 0.543405] intel_idle: v0.4.1 model 0x17
[ 0.543413] tsc: Marking TSC unstable due to TSC halts in idle states deeper than C2
[ 0.543680] intel_idle: lapic_timer_reliable_states 0x2
And now PowerTOP is showing:
Package | CPU 0
POLL 2.5% | POLL 0.0% 0.0 ms
C1E 2.9% | C1E 5.0% 22.4 ms
C2 0.4% | C2 0.2% 0.2 ms
C3 2.1% | C3 1.9% 0.5 ms
C4E 89.9% | C4E 92.6% 66.5 ms
| CPU 1
| POLL 10.0% 400.8 ms
| C1E 5.1% 6.4 ms
| C2 0.3% 0.1 ms
| C3 1.4% 0.6 ms
| C4E 76.8% 73.6 ms
| CPU 2
| POLL 0.0% 0.2 ms
| C1E 1.1% 3.7 ms
| C2 0.2% 0.2 ms
| C3 3.9% 1.3 ms
| C4E 93.1% 26.4 ms
| CPU 3
| POLL 0.0% 0.7 ms
| C1E 0.3% 0.3 ms
| C2 1.1% 0.4 ms
| C3 1.1% 0.5 ms
| C4E 97.0% 45.2 ms
I've finally accessed the Enhanced Core 2 C-states, and now that my cluster is rebooted the fan noise has dropped to almost nothing in this room. And it looks like there is a measurable drop in power consumption - my meter on 8 nodes appears to be averaging at least 5% lower (with one node still running the old kernel), but I'll try swapping the kernels out again as a test.
An interesting note regarding C4E support - My Yorktown Q9550S processor appears to support it (or some other sub-state of C4), as evidenced above! This confuses me, because the Intel datasheet on the Core 2 Q9000 processor (section 6.2) only mentions C-states Normal (C0), HALT (C1 = 0x00), Extended HALT (C1E = 0x01), Stop Grant (C2 = 0x10), Extended Stop Grant (C2E = 0x11), Sleep/Deep Sleep (C3 = 0x20) and Deeper Sleep (C4 = 0x30). What is this additional 0x31 state? If I enable state C2, then C4E is used instead of C4. If I disable state C2 (force state C2E) then C4 is used instead of C4E. I suspect this may have something to do with the MWAIT flags, but I haven't yet found documentation for this behavior.
I'm not certain what to make of this: The C1E state appears to be used in lieu of C1, C2 is used in lieu of C2E and C4E is used in lieu of C4. I'm uncertain if C1/C1E, C2/C2E and C4/C4E can be used together with intel_idle
or if they are redundant. I found a note in this 2010 presentation by Intel Labs Pittsburgh that indicates the transitions are C0 - C1 - C0 - C1E - C0, and further states:
C1E is only used when all the cores are in C1E
I believe that is to be interpreted as the C1E state is entered on other components (e.g. memory) only when all cores are in the C1E state. I also take this to apply equivalently to the C2/C2E and C4/C4E states (Although C4E is referred to as "C4E/C5" so I'm uncertain if C4E is a sub-state of C4 or if C5 is a sub-state of C4E. Testing seems to indicate C4/C4E is correct). I can force C2E to be used by commenting out the C2 state - however, this causes the C4 state to be used instead of C4E (more work may be required here). Hopefully there aren't any model 15 or model 23 processors that lack state C2E, because those processors would be limited to C1/C1E with the above code.
Also, the flags, latency and residency values could probably stand to be fine-tuned, but just taking educated guesses based on the Nehalem idle values seems to work fine. More reading will be required to make any improvements.
I tested this on a Core 2 Duo E2220 (Allendale), a Dual Core Pentium E5300 (Wolfdale), Core 2 Duo E7400, Core 2 Duo E8400 (Wolfdale), Core 2 Quad Q9550S (Yorkfield) and Core 2 Extreme QX9650, and I have found no issues beyond the afore-mentioned preference for state C2/C2E and C4/C4E.
Not covered by this driver modification:
- The original Core Solo/Core Duo (Yonah, non Core 2) are family 6, model 14. This is good because they supported the C4E/C5 (Enhanced Deep Sleep) C-states but not the C1E/C2E states and would need their own idle definition.
The only issues that I can think of are:
- Core 2 Solo SU3300/SU3500 (Penryn-L) are family 6, model 23 and will be detected by this driver. However, they are not Socket LGA775 so they may not support the C1E Enhanced Halt C-state. Likewise for the Core 2 Solo ULV U2100/U2200 (Merom-L). However, the
intel_idle
driver appears to choose the appropriate C1/C1E based on hardware support of the sub-states. - Core 2 Extreme QX9650 (Yorkfield) reportedly does not support C-state C2 or C4. I have confirmed this by purchasing a used Optiplex 780 and QX9650 Extreme processor on eBay. The processor supports C-states C1 and C1E. With this driver modification, the CPU idles in state C1E instead of C1, so there is presumably some power savings. I expected to see C-state C3, but it is not present when using this driver so I may need to look into this further.
I managed to find a slide from a 2009 Intel presentation on the transitions between C-states (i.e., Deep Power Down):
In conclusion, it turns out that there was no real reason for the lack of Core 2 support in the intel_idle
driver. It is clear now that the original stub code for "Core 2 Duo" only handled C-states C1 and C2, which would have been far less efficient than the acpi_idle
function which also handles C-state C3. Once I knew where to look, implementing support was easy. The helpful comments and other answers were much appreciated, and if Amazon is listening, you know where to send the check.
This update has been committed to github. I will e-mail a patch to the LKML soon.
Update: I also managed to dig up a Socket T/LGA775 Allendale (Conroe) Core 2 Duo E2220, which is family 6, model 15, so I added support for that as well. This model lacks support for C-state C4, but supports C1/C1E and C2/C2E. This should also work for other Conroe-based chips (E4xxx/E6xxx) and possibly all Kentsfield and Merom (non Merom-L) processors.
Update: I finally found some MWAIT tuning resources. This Power vs. Performance writeup and this Deeper C states and increased latency blog post both contain some useful information on identifying CPU idle latencies. Unfortunately, this only reports those exit latencies that were coded into the kernel (but, interestingly, only those hardware states supported by the processor):
# cd /sys/devices/system/cpu/cpu0/cpuidle
# for state in `ls -d state*` ; do echo c-$state `cat $state/name` `cat $state/latency` ; done
c-state0/ POLL 0
c-state1/ C1 3
c-state2/ C1E 10
c-state3/ C2 20
c-state4/ C2E 40
c-state5/ C3 20
c-state6/ C4 60
c-state7/ C4E 100
4
That is nice detective work! I had forgotten how complex the C2D/C2Q C-states were. Re untapped power savings, if your firmware is good enough then you should still be getting the benefit of at least some of the C-states viaacpi_idle
and the various performance governors. What states doespowertop
show on your system?
â Stephen Kitt
Jul 12 at 21:07
1
Very nice information, have you considered proposing your patch to the upstream Linux kernel?
â Lekensteyn
Jul 13 at 9:10
1
"The C1E state appears to be used in lieu of C1..." Which state is used - as shown by powertop - is determined solely by the kernel, therefore I believe it will not "have something to do with the MWAIT flags", it will be chosen solely based on the order of the states and the exit_latency and target_residency. That said, I would be slightly concerned about leave states in the table if they didn't seem to get used when tested... in case those states didn't actually work as expected, and there was some other workload pattern that led to them being used & the unexpected behaviour happening.
â sourcejedi
Jul 15 at 14:13
1
"the transitions are C0 - C1 - C0 - C1E - C0" - I don't think that's a good description of that slide. From the kernel /powertop
point of view, all transitions are either from C0 or to C0. If you're not in C0, you're not running any instructions, therefore the kernel cannot either observe or request any transition between states on that cpu :-). And as you say, the kernel "menu" governor may well e.g. jump straight into C1E, without spending any time in C1 first.
â sourcejedi
Jul 15 at 14:29
1
"just taking educated guesses based on the Nehalem idle values seems to work fine" - note this is not a good way to get your patch accepted upstream :-P, in that the exit latency must not be an underestimate, otherwise I think you will violate PM_QOS_CPU_DMA_LATENCY, which may be set by drivers (or userspace?)
â sourcejedi
Jul 15 at 14:41
 |Â
show 4 more comments
up vote
22
down vote
accepted
While researching Core 2 CPU power states ("C-states"), I actually managed to implement support for most of the legacy Intel Core/Core 2 processors. The complete implementation (Linux patch) with all of the background information is documented here.
As I accumulated more information about these processors, it started to become apparent that the C-states supported in the Core 2 model(s) are far more complex than those in both earlier and later processors. These are known as Enhanced C-states (or "CxE"), which involve the package, individual cores and other components on the chipset (e.g., memory). At the time the intel_idle
driver was released, the code was not particularly mature and several Core 2 processors had been released that had conflicting C-state support.
Some compelling information on Core 2 Solo/Duo C-state support was found in this article from 2006. This is in relation to support on Windows, however it does indicate the robust hardware C-state support on these processors. The information regarding Kentsfield conflicts with the actual model number, so I believe they are actually referring to a Yorkfield below:
...the quad-core Intel Core 2 Extreme (Kentsfield) processor supports
all five performance and power saving technologies â Enhanced Intel
SpeedStep (EIST), Thermal Monitor 1 (TM1) and Thermal Monitor 2 (TM2),
old On-Demand Clock Modulation (ODCM), as well as Enhanced C States
(CxE). Compared to Intel Pentium 4 and Pentium D 600, 800, and 900
processors, which are characterized only by Enhanced Halt (C1) State,
this function has been expanded in Intel Core 2 processors (as well as
Intel Core Solo/Duo processors) for all possible idle states of a
processor, including Stop Grant (C2), Deep Sleep (C3), and Deeper
Sleep (C4).
This article from 2008 outlines support for per-core C-states on multi-core Intel processors, including Core 2 Duo and Core 2 Quad (additional helpful background reading was found in this white paper from Dell):
A core C-state is a hardware C-state. There are several core idle
states, e.g. CC1 and CC3. As we know, a modern state of the art
processor has multiple cores, such as the recently released Core Duo
T5000/T7000 mobile processors, known as Penryn in some circles. What
we used to think of as a CPU / processor, actually has multiple
general purpose CPUs in side of it. The Intel Core Duo has 2 cores in
the processor chip. The Intel Core-2 Quad has 4 such cores per
processor chip. Each of these cores has its own idle state. This makes
sense as one core might be idle while another is hard at work on a
thread. So a core C-state is the idle state of one of those cores.
I found a 2010 presentation from Intel that provides some additional background about the intel_idle
driver, but unfortunately does not explain the lack of support for Core 2:
This EXPERIMENTAL driver supersedes acpi_idle on Intel Atom
Processors, Intel Core i3/i5/i7 Processors and associated Intel Xeon
processors. It does not support the Intel Core2 processor or earlier.
The above presentation does indicate that the intel_idle
driver is an implementation of the "menu" CPU governor, which has an impact on Linux kernel configuration (i.e., CONFIG_CPU_IDLE_GOV_LADDER
vs. CONFIG_CPU_IDLE_GOV_MENU
). The differences between the ladder and menu governors are succinctly described in this answer.
Dell has a helpful article that lists C-state C0 to C6 compatibility:
Modes C1 to C3 work by basically cutting clock signals used inside the
CPU, while modes C4 to C6 work by reducing the CPU voltage. "Enhanced"
modes can do both at the same time.
Mode Name CPUs
C0 Operating State All CPUs
C1 Halt 486DX4 and above
C1E Enhanced Halt All socket LGA775 CPUs
C1E â Turion 64, 65-nm Athlon X2 and Phenom CPUs
C2 Stop Grant 486DX4 and above
C2 Stop Clock Only 486DX4, Pentium, Pentium MMX, K5, K6, K6-2, K6-III
C2E Extended Stop Grant Core 2 Duo and above (Intel only)
C3 Sleep Pentium II, Athlon and above, but not on Core 2 Duo E4000 and E6000
C3 Deep Sleep Pentium II and above, but not on Core 2 Duo E4000 and E6000; Turion 64
C3 AltVID AMD Turion 64
C4 Deeper Sleep Pentium M and above, but not on Core 2 Duo E4000 and E6000 series; AMD Turion 64
C4E/C5 Enhanced Deeper Sleep Core Solo, Core Duo and 45-nm mobile Core 2 Duo only
C6 Deep Power Down 45-nm mobile Core 2 Duo only
From this table (which I later found to be incorrect in some cases), it appears that there were a variety of differences in C-state support with the Core 2 processors (Note that nearly all Core 2 processors are Socket LGA775, except for Core 2 Solo SU3500, which is Socket BGA956 and Merom/Penryn processors. "Intel Core" Solo/Duo processors are one of Socket PBGA479 or PPGA478).
An additional exception to the table was found in this article:
IntelâÂÂs Core 2 Duo E8500 supports C-states C2 and C4, while the Core 2
Extreme QX9650 does not.
Interestingly, the QX9650 is a Yorkfield processor (Intel family 6, model 23, stepping 6). For reference, my Q9550S is Intel family 6, model 23 (0x17), stepping 10, which supposedly supports C-state C4 (confirmed through experimentation). Additionally, the Core 2 Solo U3500 has an identical CPUID (family, model, stepping) to the Q9550S but is available in a non-LGA775 socket, which confounds interpretation of the above table.
Clearly, the CPUID must be used at least down to the stepping in order to identify C-state support for this model of processor, and in some cases that may be insufficient (undetermined at this time).
The method signature for assigning CPU idle information is:
#define ICPU(model, cpu)
X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, (unsigned long)&cpu
Where model
is enumerated in asm/intel-family.h. Examining this header file, I see that Intel CPUs are assigned 8-bit identifiers that appear to match the Intel family 6 model numbers:
#define INTEL_FAM6_CORE2_PENRYN 0x17
From the above, we have Intel Family 6, Model 23 (0x17) defined as INTEL_FAM6_CORE2_PENRYN
. This should be sufficient for defining idle states for most of the Model 23 processors, but could potentially cause issues with QX9650 as noted above.
So, minimally, each group of processors that has a distinct C-state set would need to be defined in this list.
Zagacki and Ponnala, Intel Technology Journal 12(3):219-227, 2008 indicate that Yorkfield processors do indeed support C2 and C4. They also seem to indicate that the ACPI 3.0a specification supports transitions only between C-states C0, C1, C2 and C3, which I presume may also limit the Linux acpi_idle
driver to transitions between that limited set of C-states. However, this article indicates that may not always be the case:
Bear in mind that is the ACPI C state, not the processor one, so ACPI
C3 might be HW C6, etc.
Also of note:
Beyond the processor itself, since C4 is a synchronized effort between
major silicon components in the platform, the Intel Q45 Express
Chipset achieves a 28-percent power improvement.
The chipset I'm using is indeed an Intel Q45 Express Chipset.
The Intel documentation on MWAIT states is terse but confirms the BIOS-specific ACPI behavior:
The processor-specific C-states defined in MWAIT extensions can map to
ACPI defined C-state types (C0, C1, C2, C3). The mapping relationship
depends on the definition of a C-state by processor implementation and
is exposed to OSPM by the BIOS using the ACPI defined _CST table.
My interpretation of the above table (combined with a table from Wikipedia, asm/intel-family.h and the above articles) is:
Model 9 0x09 (Pentium M and Celeron M):
- Banias: C0, C1, C2, C3, C4
Model 13 0x0D (Pentium M and Celeron M):
- Dothan, Stealey: C0, C1, C2, C3, C4
Model 14 0x0E INTEL_FAM6_CORE_YONAH (Enhanced Pentium M, Enhanced Celeron M or Intel Core):
- Yonah (Core Solo, Core Duo): C0, C1, C2, C3, C4, C4E/C5
Model 15 0x0F INTEL_FAM6_CORE2_MEROM (some Core 2 and Pentium Dual-Core):
- Kentsfield, Merom, Conroe, Allendale (E2xxx/E4xxx and Core 2 Duo E6xxx, T7xxxx/T8xxxx, Core 2 Extreme QX6xxx, Core 2 Quad Q6xxx): C0, C1, C1E, C2, C2E
Model 23 0x17 INTEL_FAM6_CORE2_PENRYN (Core 2):
- Merom-L/Penryn-L: ?
- Penryn (Core 2 Duo 45-nm mobile): C0, C1, C1E, C2, C2E, C3, C4, C4E/C5, C6
- Yorkfield (Core 2 Extreme QX9650): C0, C1, C1E, C2E?, C3
- Wolfdale/Yorkfield (Core 2 Quad, C2Q Xeon, Core 2 Duo E5xxx/E7xxx/E8xxx, Pentium Dual-Core E6xxx, Celeron Dual-Core): C0, C1, C1E, C2, C2E, C3, C4
From the amount of diversity in C-state support within just the Core 2 line of processors, it appears that a lack of consistent support for C-states may have been the reason for not attempting to fully support them via the intel_idle
driver. I would like to fully complete the above list for the entire Core 2 line.
This is not really a satisfying answer, because it makes me wonder how much unnecessary power is used and excess heat has been (and still is) generated by not fully utilizing the robust power-saving MWAIT C-states on these processors.
Chattopadhyay et al. 2018, Energy Efficient High Performance Processors: Recent Approaches for Designing Green High Performance Computing is worth noting for the specific behavior I'm looking for in the Q45 Express Chipset:
Package C-state (PC0-PC10) - When the compute domains, Core and
Graphics (GPU) are idle, the processor has an opportunity for
additional power savings at uncore and platform levels, for example,
flushing the LLC and power-gating the memory controller and DRAM IO,
and at some state, the whole processor can be turned off while its
state is preserved on always-on power domain.
As a test, I inserted the following at linux/drivers/idle/intel_idle.c line 127:
static struct cpuidle_state conroe_cstates =
.name = "C1",
.desc = "MWAIT 0x00",
.flags = MWAIT2flg(0x00),
.exit_latency = 3,
.target_residency = 6,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C1E",
.desc = "MWAIT 0x01",
.flags = MWAIT2flg(0x01),
.exit_latency = 10,
.target_residency = 20,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
//
// .name = "C2",
// .desc = "MWAIT 0x10",
// .flags = MWAIT2flg(0x10),
// .exit_latency = 20,
// .target_residency = 40,
// .enter = &intel_idle,
// .enter_s2idle = intel_idle_s2idle, ,
.name = "C2E",
.desc = "MWAIT 0x11",
.flags = MWAIT2flg(0x11),
.exit_latency = 40,
.target_residency = 100,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.enter = NULL
;
static struct cpuidle_state core2_cstates =
.name = "C1",
.desc = "MWAIT 0x00",
.flags = MWAIT2flg(0x00),
.exit_latency = 3,
.target_residency = 6,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C1E",
.desc = "MWAIT 0x01",
.flags = MWAIT2flg(0x01),
.exit_latency = 10,
.target_residency = 20,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C2",
.desc = "MWAIT 0x10",
.flags = MWAIT2flg(0x10),
.exit_latency = 20,
.target_residency = 40,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C2E",
.desc = "MWAIT 0x11",
.flags = MWAIT2flg(0x11),
.exit_latency = 40,
.target_residency = 100,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C3",
.desc = "MWAIT 0x20",
.flags = MWAIT2flg(0x20) ,
.name = "C4",
.desc = "MWAIT 0x30",
.flags = MWAIT2flg(0x30) ,
CPUIDLE_FLAG_TLB_FLUSHED,
.exit_latency = 100,
.target_residency = 400,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
CPUIDLE_FLAG_TLB_FLUSHED,
.exit_latency = 200,
.target_residency = 800,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.enter = NULL
;
at intel_idle.c
line 983:
static const struct idle_cpu idle_cpu_conroe =
.state_table = conroe_cstates,
.disable_promotion_to_c1e = false,
;
static const struct idle_cpu idle_cpu_core2 =
.state_table = core2_cstates,
.disable_promotion_to_c1e = false,
;
at intel_idle.c
line 1073:
ICPU(INTEL_FAM6_CORE2_MEROM, idle_cpu_conroe),
ICPU(INTEL_FAM6_CORE2_PENRYN, idle_cpu_core2),
After a quick compile and reboot of my PXE nodes, dmesg
now shows:
[ 0.019845] cpuidle: using governor menu
[ 0.515785] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[ 0.543404] intel_idle: MWAIT substates: 0x22220
[ 0.543405] intel_idle: v0.4.1 model 0x17
[ 0.543413] tsc: Marking TSC unstable due to TSC halts in idle states deeper than C2
[ 0.543680] intel_idle: lapic_timer_reliable_states 0x2
And now PowerTOP is showing:
Package | CPU 0
POLL 2.5% | POLL 0.0% 0.0 ms
C1E 2.9% | C1E 5.0% 22.4 ms
C2 0.4% | C2 0.2% 0.2 ms
C3 2.1% | C3 1.9% 0.5 ms
C4E 89.9% | C4E 92.6% 66.5 ms
| CPU 1
| POLL 10.0% 400.8 ms
| C1E 5.1% 6.4 ms
| C2 0.3% 0.1 ms
| C3 1.4% 0.6 ms
| C4E 76.8% 73.6 ms
| CPU 2
| POLL 0.0% 0.2 ms
| C1E 1.1% 3.7 ms
| C2 0.2% 0.2 ms
| C3 3.9% 1.3 ms
| C4E 93.1% 26.4 ms
| CPU 3
| POLL 0.0% 0.7 ms
| C1E 0.3% 0.3 ms
| C2 1.1% 0.4 ms
| C3 1.1% 0.5 ms
| C4E 97.0% 45.2 ms
I've finally accessed the Enhanced Core 2 C-states, and now that my cluster is rebooted the fan noise has dropped to almost nothing in this room. And it looks like there is a measurable drop in power consumption - my meter on 8 nodes appears to be averaging at least 5% lower (with one node still running the old kernel), but I'll try swapping the kernels out again as a test.
An interesting note regarding C4E support - My Yorktown Q9550S processor appears to support it (or some other sub-state of C4), as evidenced above! This confuses me, because the Intel datasheet on the Core 2 Q9000 processor (section 6.2) only mentions C-states Normal (C0), HALT (C1 = 0x00), Extended HALT (C1E = 0x01), Stop Grant (C2 = 0x10), Extended Stop Grant (C2E = 0x11), Sleep/Deep Sleep (C3 = 0x20) and Deeper Sleep (C4 = 0x30). What is this additional 0x31 state? If I enable state C2, then C4E is used instead of C4. If I disable state C2 (force state C2E) then C4 is used instead of C4E. I suspect this may have something to do with the MWAIT flags, but I haven't yet found documentation for this behavior.
I'm not certain what to make of this: The C1E state appears to be used in lieu of C1, C2 is used in lieu of C2E and C4E is used in lieu of C4. I'm uncertain if C1/C1E, C2/C2E and C4/C4E can be used together with intel_idle
or if they are redundant. I found a note in this 2010 presentation by Intel Labs Pittsburgh that indicates the transitions are C0 - C1 - C0 - C1E - C0, and further states:
C1E is only used when all the cores are in C1E
I believe that is to be interpreted as the C1E state is entered on other components (e.g. memory) only when all cores are in the C1E state. I also take this to apply equivalently to the C2/C2E and C4/C4E states (Although C4E is referred to as "C4E/C5" so I'm uncertain if C4E is a sub-state of C4 or if C5 is a sub-state of C4E. Testing seems to indicate C4/C4E is correct). I can force C2E to be used by commenting out the C2 state - however, this causes the C4 state to be used instead of C4E (more work may be required here). Hopefully there aren't any model 15 or model 23 processors that lack state C2E, because those processors would be limited to C1/C1E with the above code.
Also, the flags, latency and residency values could probably stand to be fine-tuned, but just taking educated guesses based on the Nehalem idle values seems to work fine. More reading will be required to make any improvements.
I tested this on a Core 2 Duo E2220 (Allendale), a Dual Core Pentium E5300 (Wolfdale), Core 2 Duo E7400, Core 2 Duo E8400 (Wolfdale), Core 2 Quad Q9550S (Yorkfield) and Core 2 Extreme QX9650, and I have found no issues beyond the afore-mentioned preference for state C2/C2E and C4/C4E.
Not covered by this driver modification:
- The original Core Solo/Core Duo (Yonah, non Core 2) are family 6, model 14. This is good because they supported the C4E/C5 (Enhanced Deep Sleep) C-states but not the C1E/C2E states and would need their own idle definition.
The only issues that I can think of are:
- Core 2 Solo SU3300/SU3500 (Penryn-L) are family 6, model 23 and will be detected by this driver. However, they are not Socket LGA775 so they may not support the C1E Enhanced Halt C-state. Likewise for the Core 2 Solo ULV U2100/U2200 (Merom-L). However, the
intel_idle
driver appears to choose the appropriate C1/C1E based on hardware support of the sub-states. - Core 2 Extreme QX9650 (Yorkfield) reportedly does not support C-state C2 or C4. I have confirmed this by purchasing a used Optiplex 780 and QX9650 Extreme processor on eBay. The processor supports C-states C1 and C1E. With this driver modification, the CPU idles in state C1E instead of C1, so there is presumably some power savings. I expected to see C-state C3, but it is not present when using this driver so I may need to look into this further.
I managed to find a slide from a 2009 Intel presentation on the transitions between C-states (i.e., Deep Power Down):
In conclusion, it turns out that there was no real reason for the lack of Core 2 support in the intel_idle
driver. It is clear now that the original stub code for "Core 2 Duo" only handled C-states C1 and C2, which would have been far less efficient than the acpi_idle
function which also handles C-state C3. Once I knew where to look, implementing support was easy. The helpful comments and other answers were much appreciated, and if Amazon is listening, you know where to send the check.
This update has been committed to github. I will e-mail a patch to the LKML soon.
Update: I also managed to dig up a Socket T/LGA775 Allendale (Conroe) Core 2 Duo E2220, which is family 6, model 15, so I added support for that as well. This model lacks support for C-state C4, but supports C1/C1E and C2/C2E. This should also work for other Conroe-based chips (E4xxx/E6xxx) and possibly all Kentsfield and Merom (non Merom-L) processors.
Update: I finally found some MWAIT tuning resources. This Power vs. Performance writeup and this Deeper C states and increased latency blog post both contain some useful information on identifying CPU idle latencies. Unfortunately, this only reports those exit latencies that were coded into the kernel (but, interestingly, only those hardware states supported by the processor):
# cd /sys/devices/system/cpu/cpu0/cpuidle
# for state in `ls -d state*` ; do echo c-$state `cat $state/name` `cat $state/latency` ; done
c-state0/ POLL 0
c-state1/ C1 3
c-state2/ C1E 10
c-state3/ C2 20
c-state4/ C2E 40
c-state5/ C3 20
c-state6/ C4 60
c-state7/ C4E 100
4
That is nice detective work! I had forgotten how complex the C2D/C2Q C-states were. Re untapped power savings, if your firmware is good enough then you should still be getting the benefit of at least some of the C-states viaacpi_idle
and the various performance governors. What states doespowertop
show on your system?
â Stephen Kitt
Jul 12 at 21:07
1
Very nice information, have you considered proposing your patch to the upstream Linux kernel?
â Lekensteyn
Jul 13 at 9:10
1
"The C1E state appears to be used in lieu of C1..." Which state is used - as shown by powertop - is determined solely by the kernel, therefore I believe it will not "have something to do with the MWAIT flags", it will be chosen solely based on the order of the states and the exit_latency and target_residency. That said, I would be slightly concerned about leave states in the table if they didn't seem to get used when tested... in case those states didn't actually work as expected, and there was some other workload pattern that led to them being used & the unexpected behaviour happening.
â sourcejedi
Jul 15 at 14:13
1
"the transitions are C0 - C1 - C0 - C1E - C0" - I don't think that's a good description of that slide. From the kernel /powertop
point of view, all transitions are either from C0 or to C0. If you're not in C0, you're not running any instructions, therefore the kernel cannot either observe or request any transition between states on that cpu :-). And as you say, the kernel "menu" governor may well e.g. jump straight into C1E, without spending any time in C1 first.
â sourcejedi
Jul 15 at 14:29
1
"just taking educated guesses based on the Nehalem idle values seems to work fine" - note this is not a good way to get your patch accepted upstream :-P, in that the exit latency must not be an underestimate, otherwise I think you will violate PM_QOS_CPU_DMA_LATENCY, which may be set by drivers (or userspace?)
â sourcejedi
Jul 15 at 14:41
 |Â
show 4 more comments
up vote
22
down vote
accepted
up vote
22
down vote
accepted
While researching Core 2 CPU power states ("C-states"), I actually managed to implement support for most of the legacy Intel Core/Core 2 processors. The complete implementation (Linux patch) with all of the background information is documented here.
As I accumulated more information about these processors, it started to become apparent that the C-states supported in the Core 2 model(s) are far more complex than those in both earlier and later processors. These are known as Enhanced C-states (or "CxE"), which involve the package, individual cores and other components on the chipset (e.g., memory). At the time the intel_idle
driver was released, the code was not particularly mature and several Core 2 processors had been released that had conflicting C-state support.
Some compelling information on Core 2 Solo/Duo C-state support was found in this article from 2006. This is in relation to support on Windows, however it does indicate the robust hardware C-state support on these processors. The information regarding Kentsfield conflicts with the actual model number, so I believe they are actually referring to a Yorkfield below:
...the quad-core Intel Core 2 Extreme (Kentsfield) processor supports
all five performance and power saving technologies â Enhanced Intel
SpeedStep (EIST), Thermal Monitor 1 (TM1) and Thermal Monitor 2 (TM2),
old On-Demand Clock Modulation (ODCM), as well as Enhanced C States
(CxE). Compared to Intel Pentium 4 and Pentium D 600, 800, and 900
processors, which are characterized only by Enhanced Halt (C1) State,
this function has been expanded in Intel Core 2 processors (as well as
Intel Core Solo/Duo processors) for all possible idle states of a
processor, including Stop Grant (C2), Deep Sleep (C3), and Deeper
Sleep (C4).
This article from 2008 outlines support for per-core C-states on multi-core Intel processors, including Core 2 Duo and Core 2 Quad (additional helpful background reading was found in this white paper from Dell):
A core C-state is a hardware C-state. There are several core idle
states, e.g. CC1 and CC3. As we know, a modern state of the art
processor has multiple cores, such as the recently released Core Duo
T5000/T7000 mobile processors, known as Penryn in some circles. What
we used to think of as a CPU / processor, actually has multiple
general purpose CPUs in side of it. The Intel Core Duo has 2 cores in
the processor chip. The Intel Core-2 Quad has 4 such cores per
processor chip. Each of these cores has its own idle state. This makes
sense as one core might be idle while another is hard at work on a
thread. So a core C-state is the idle state of one of those cores.
I found a 2010 presentation from Intel that provides some additional background about the intel_idle
driver, but unfortunately does not explain the lack of support for Core 2:
This EXPERIMENTAL driver supersedes acpi_idle on Intel Atom
Processors, Intel Core i3/i5/i7 Processors and associated Intel Xeon
processors. It does not support the Intel Core2 processor or earlier.
The above presentation does indicate that the intel_idle
driver is an implementation of the "menu" CPU governor, which has an impact on Linux kernel configuration (i.e., CONFIG_CPU_IDLE_GOV_LADDER
vs. CONFIG_CPU_IDLE_GOV_MENU
). The differences between the ladder and menu governors are succinctly described in this answer.
Dell has a helpful article that lists C-state C0 to C6 compatibility:
Modes C1 to C3 work by basically cutting clock signals used inside the
CPU, while modes C4 to C6 work by reducing the CPU voltage. "Enhanced"
modes can do both at the same time.
Mode Name CPUs
C0 Operating State All CPUs
C1 Halt 486DX4 and above
C1E Enhanced Halt All socket LGA775 CPUs
C1E â Turion 64, 65-nm Athlon X2 and Phenom CPUs
C2 Stop Grant 486DX4 and above
C2 Stop Clock Only 486DX4, Pentium, Pentium MMX, K5, K6, K6-2, K6-III
C2E Extended Stop Grant Core 2 Duo and above (Intel only)
C3 Sleep Pentium II, Athlon and above, but not on Core 2 Duo E4000 and E6000
C3 Deep Sleep Pentium II and above, but not on Core 2 Duo E4000 and E6000; Turion 64
C3 AltVID AMD Turion 64
C4 Deeper Sleep Pentium M and above, but not on Core 2 Duo E4000 and E6000 series; AMD Turion 64
C4E/C5 Enhanced Deeper Sleep Core Solo, Core Duo and 45-nm mobile Core 2 Duo only
C6 Deep Power Down 45-nm mobile Core 2 Duo only
From this table (which I later found to be incorrect in some cases), it appears that there were a variety of differences in C-state support with the Core 2 processors (Note that nearly all Core 2 processors are Socket LGA775, except for Core 2 Solo SU3500, which is Socket BGA956 and Merom/Penryn processors. "Intel Core" Solo/Duo processors are one of Socket PBGA479 or PPGA478).
An additional exception to the table was found in this article:
IntelâÂÂs Core 2 Duo E8500 supports C-states C2 and C4, while the Core 2
Extreme QX9650 does not.
Interestingly, the QX9650 is a Yorkfield processor (Intel family 6, model 23, stepping 6). For reference, my Q9550S is Intel family 6, model 23 (0x17), stepping 10, which supposedly supports C-state C4 (confirmed through experimentation). Additionally, the Core 2 Solo U3500 has an identical CPUID (family, model, stepping) to the Q9550S but is available in a non-LGA775 socket, which confounds interpretation of the above table.
Clearly, the CPUID must be used at least down to the stepping in order to identify C-state support for this model of processor, and in some cases that may be insufficient (undetermined at this time).
The method signature for assigning CPU idle information is:
#define ICPU(model, cpu)
X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, (unsigned long)&cpu
Where model
is enumerated in asm/intel-family.h. Examining this header file, I see that Intel CPUs are assigned 8-bit identifiers that appear to match the Intel family 6 model numbers:
#define INTEL_FAM6_CORE2_PENRYN 0x17
From the above, we have Intel Family 6, Model 23 (0x17) defined as INTEL_FAM6_CORE2_PENRYN
. This should be sufficient for defining idle states for most of the Model 23 processors, but could potentially cause issues with QX9650 as noted above.
So, minimally, each group of processors that has a distinct C-state set would need to be defined in this list.
Zagacki and Ponnala, Intel Technology Journal 12(3):219-227, 2008 indicate that Yorkfield processors do indeed support C2 and C4. They also seem to indicate that the ACPI 3.0a specification supports transitions only between C-states C0, C1, C2 and C3, which I presume may also limit the Linux acpi_idle
driver to transitions between that limited set of C-states. However, this article indicates that may not always be the case:
Bear in mind that is the ACPI C state, not the processor one, so ACPI
C3 might be HW C6, etc.
Also of note:
Beyond the processor itself, since C4 is a synchronized effort between
major silicon components in the platform, the Intel Q45 Express
Chipset achieves a 28-percent power improvement.
The chipset I'm using is indeed an Intel Q45 Express Chipset.
The Intel documentation on MWAIT states is terse but confirms the BIOS-specific ACPI behavior:
The processor-specific C-states defined in MWAIT extensions can map to
ACPI defined C-state types (C0, C1, C2, C3). The mapping relationship
depends on the definition of a C-state by processor implementation and
is exposed to OSPM by the BIOS using the ACPI defined _CST table.
My interpretation of the above table (combined with a table from Wikipedia, asm/intel-family.h and the above articles) is:
Model 9 0x09 (Pentium M and Celeron M):
- Banias: C0, C1, C2, C3, C4
Model 13 0x0D (Pentium M and Celeron M):
- Dothan, Stealey: C0, C1, C2, C3, C4
Model 14 0x0E INTEL_FAM6_CORE_YONAH (Enhanced Pentium M, Enhanced Celeron M or Intel Core):
- Yonah (Core Solo, Core Duo): C0, C1, C2, C3, C4, C4E/C5
Model 15 0x0F INTEL_FAM6_CORE2_MEROM (some Core 2 and Pentium Dual-Core):
- Kentsfield, Merom, Conroe, Allendale (E2xxx/E4xxx and Core 2 Duo E6xxx, T7xxxx/T8xxxx, Core 2 Extreme QX6xxx, Core 2 Quad Q6xxx): C0, C1, C1E, C2, C2E
Model 23 0x17 INTEL_FAM6_CORE2_PENRYN (Core 2):
- Merom-L/Penryn-L: ?
- Penryn (Core 2 Duo 45-nm mobile): C0, C1, C1E, C2, C2E, C3, C4, C4E/C5, C6
- Yorkfield (Core 2 Extreme QX9650): C0, C1, C1E, C2E?, C3
- Wolfdale/Yorkfield (Core 2 Quad, C2Q Xeon, Core 2 Duo E5xxx/E7xxx/E8xxx, Pentium Dual-Core E6xxx, Celeron Dual-Core): C0, C1, C1E, C2, C2E, C3, C4
From the amount of diversity in C-state support within just the Core 2 line of processors, it appears that a lack of consistent support for C-states may have been the reason for not attempting to fully support them via the intel_idle
driver. I would like to fully complete the above list for the entire Core 2 line.
This is not really a satisfying answer, because it makes me wonder how much unnecessary power is used and excess heat has been (and still is) generated by not fully utilizing the robust power-saving MWAIT C-states on these processors.
Chattopadhyay et al. 2018, Energy Efficient High Performance Processors: Recent Approaches for Designing Green High Performance Computing is worth noting for the specific behavior I'm looking for in the Q45 Express Chipset:
Package C-state (PC0-PC10) - When the compute domains, Core and
Graphics (GPU) are idle, the processor has an opportunity for
additional power savings at uncore and platform levels, for example,
flushing the LLC and power-gating the memory controller and DRAM IO,
and at some state, the whole processor can be turned off while its
state is preserved on always-on power domain.
As a test, I inserted the following at linux/drivers/idle/intel_idle.c line 127:
static struct cpuidle_state conroe_cstates =
.name = "C1",
.desc = "MWAIT 0x00",
.flags = MWAIT2flg(0x00),
.exit_latency = 3,
.target_residency = 6,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C1E",
.desc = "MWAIT 0x01",
.flags = MWAIT2flg(0x01),
.exit_latency = 10,
.target_residency = 20,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
//
// .name = "C2",
// .desc = "MWAIT 0x10",
// .flags = MWAIT2flg(0x10),
// .exit_latency = 20,
// .target_residency = 40,
// .enter = &intel_idle,
// .enter_s2idle = intel_idle_s2idle, ,
.name = "C2E",
.desc = "MWAIT 0x11",
.flags = MWAIT2flg(0x11),
.exit_latency = 40,
.target_residency = 100,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.enter = NULL
;
static struct cpuidle_state core2_cstates =
.name = "C1",
.desc = "MWAIT 0x00",
.flags = MWAIT2flg(0x00),
.exit_latency = 3,
.target_residency = 6,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C1E",
.desc = "MWAIT 0x01",
.flags = MWAIT2flg(0x01),
.exit_latency = 10,
.target_residency = 20,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C2",
.desc = "MWAIT 0x10",
.flags = MWAIT2flg(0x10),
.exit_latency = 20,
.target_residency = 40,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C2E",
.desc = "MWAIT 0x11",
.flags = MWAIT2flg(0x11),
.exit_latency = 40,
.target_residency = 100,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C3",
.desc = "MWAIT 0x20",
.flags = MWAIT2flg(0x20) ,
.name = "C4",
.desc = "MWAIT 0x30",
.flags = MWAIT2flg(0x30) ,
CPUIDLE_FLAG_TLB_FLUSHED,
.exit_latency = 100,
.target_residency = 400,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
CPUIDLE_FLAG_TLB_FLUSHED,
.exit_latency = 200,
.target_residency = 800,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.enter = NULL
;
at intel_idle.c
line 983:
static const struct idle_cpu idle_cpu_conroe =
.state_table = conroe_cstates,
.disable_promotion_to_c1e = false,
;
static const struct idle_cpu idle_cpu_core2 =
.state_table = core2_cstates,
.disable_promotion_to_c1e = false,
;
at intel_idle.c
line 1073:
ICPU(INTEL_FAM6_CORE2_MEROM, idle_cpu_conroe),
ICPU(INTEL_FAM6_CORE2_PENRYN, idle_cpu_core2),
After a quick compile and reboot of my PXE nodes, dmesg
now shows:
[ 0.019845] cpuidle: using governor menu
[ 0.515785] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[ 0.543404] intel_idle: MWAIT substates: 0x22220
[ 0.543405] intel_idle: v0.4.1 model 0x17
[ 0.543413] tsc: Marking TSC unstable due to TSC halts in idle states deeper than C2
[ 0.543680] intel_idle: lapic_timer_reliable_states 0x2
And now PowerTOP is showing:
Package | CPU 0
POLL 2.5% | POLL 0.0% 0.0 ms
C1E 2.9% | C1E 5.0% 22.4 ms
C2 0.4% | C2 0.2% 0.2 ms
C3 2.1% | C3 1.9% 0.5 ms
C4E 89.9% | C4E 92.6% 66.5 ms
| CPU 1
| POLL 10.0% 400.8 ms
| C1E 5.1% 6.4 ms
| C2 0.3% 0.1 ms
| C3 1.4% 0.6 ms
| C4E 76.8% 73.6 ms
| CPU 2
| POLL 0.0% 0.2 ms
| C1E 1.1% 3.7 ms
| C2 0.2% 0.2 ms
| C3 3.9% 1.3 ms
| C4E 93.1% 26.4 ms
| CPU 3
| POLL 0.0% 0.7 ms
| C1E 0.3% 0.3 ms
| C2 1.1% 0.4 ms
| C3 1.1% 0.5 ms
| C4E 97.0% 45.2 ms
I've finally accessed the Enhanced Core 2 C-states, and now that my cluster is rebooted the fan noise has dropped to almost nothing in this room. And it looks like there is a measurable drop in power consumption - my meter on 8 nodes appears to be averaging at least 5% lower (with one node still running the old kernel), but I'll try swapping the kernels out again as a test.
An interesting note regarding C4E support - My Yorktown Q9550S processor appears to support it (or some other sub-state of C4), as evidenced above! This confuses me, because the Intel datasheet on the Core 2 Q9000 processor (section 6.2) only mentions C-states Normal (C0), HALT (C1 = 0x00), Extended HALT (C1E = 0x01), Stop Grant (C2 = 0x10), Extended Stop Grant (C2E = 0x11), Sleep/Deep Sleep (C3 = 0x20) and Deeper Sleep (C4 = 0x30). What is this additional 0x31 state? If I enable state C2, then C4E is used instead of C4. If I disable state C2 (force state C2E) then C4 is used instead of C4E. I suspect this may have something to do with the MWAIT flags, but I haven't yet found documentation for this behavior.
I'm not certain what to make of this: The C1E state appears to be used in lieu of C1, C2 is used in lieu of C2E and C4E is used in lieu of C4. I'm uncertain if C1/C1E, C2/C2E and C4/C4E can be used together with intel_idle
or if they are redundant. I found a note in this 2010 presentation by Intel Labs Pittsburgh that indicates the transitions are C0 - C1 - C0 - C1E - C0, and further states:
C1E is only used when all the cores are in C1E
I believe that is to be interpreted as the C1E state is entered on other components (e.g. memory) only when all cores are in the C1E state. I also take this to apply equivalently to the C2/C2E and C4/C4E states (Although C4E is referred to as "C4E/C5" so I'm uncertain if C4E is a sub-state of C4 or if C5 is a sub-state of C4E. Testing seems to indicate C4/C4E is correct). I can force C2E to be used by commenting out the C2 state - however, this causes the C4 state to be used instead of C4E (more work may be required here). Hopefully there aren't any model 15 or model 23 processors that lack state C2E, because those processors would be limited to C1/C1E with the above code.
Also, the flags, latency and residency values could probably stand to be fine-tuned, but just taking educated guesses based on the Nehalem idle values seems to work fine. More reading will be required to make any improvements.
I tested this on a Core 2 Duo E2220 (Allendale), a Dual Core Pentium E5300 (Wolfdale), Core 2 Duo E7400, Core 2 Duo E8400 (Wolfdale), Core 2 Quad Q9550S (Yorkfield) and Core 2 Extreme QX9650, and I have found no issues beyond the afore-mentioned preference for state C2/C2E and C4/C4E.
Not covered by this driver modification:
- The original Core Solo/Core Duo (Yonah, non Core 2) are family 6, model 14. This is good because they supported the C4E/C5 (Enhanced Deep Sleep) C-states but not the C1E/C2E states and would need their own idle definition.
The only issues that I can think of are:
- Core 2 Solo SU3300/SU3500 (Penryn-L) are family 6, model 23 and will be detected by this driver. However, they are not Socket LGA775 so they may not support the C1E Enhanced Halt C-state. Likewise for the Core 2 Solo ULV U2100/U2200 (Merom-L). However, the
intel_idle
driver appears to choose the appropriate C1/C1E based on hardware support of the sub-states. - Core 2 Extreme QX9650 (Yorkfield) reportedly does not support C-state C2 or C4. I have confirmed this by purchasing a used Optiplex 780 and QX9650 Extreme processor on eBay. The processor supports C-states C1 and C1E. With this driver modification, the CPU idles in state C1E instead of C1, so there is presumably some power savings. I expected to see C-state C3, but it is not present when using this driver so I may need to look into this further.
I managed to find a slide from a 2009 Intel presentation on the transitions between C-states (i.e., Deep Power Down):
In conclusion, it turns out that there was no real reason for the lack of Core 2 support in the intel_idle
driver. It is clear now that the original stub code for "Core 2 Duo" only handled C-states C1 and C2, which would have been far less efficient than the acpi_idle
function which also handles C-state C3. Once I knew where to look, implementing support was easy. The helpful comments and other answers were much appreciated, and if Amazon is listening, you know where to send the check.
This update has been committed to github. I will e-mail a patch to the LKML soon.
Update: I also managed to dig up a Socket T/LGA775 Allendale (Conroe) Core 2 Duo E2220, which is family 6, model 15, so I added support for that as well. This model lacks support for C-state C4, but supports C1/C1E and C2/C2E. This should also work for other Conroe-based chips (E4xxx/E6xxx) and possibly all Kentsfield and Merom (non Merom-L) processors.
Update: I finally found some MWAIT tuning resources. This Power vs. Performance writeup and this Deeper C states and increased latency blog post both contain some useful information on identifying CPU idle latencies. Unfortunately, this only reports those exit latencies that were coded into the kernel (but, interestingly, only those hardware states supported by the processor):
# cd /sys/devices/system/cpu/cpu0/cpuidle
# for state in `ls -d state*` ; do echo c-$state `cat $state/name` `cat $state/latency` ; done
c-state0/ POLL 0
c-state1/ C1 3
c-state2/ C1E 10
c-state3/ C2 20
c-state4/ C2E 40
c-state5/ C3 20
c-state6/ C4 60
c-state7/ C4E 100
While researching Core 2 CPU power states ("C-states"), I actually managed to implement support for most of the legacy Intel Core/Core 2 processors. The complete implementation (Linux patch) with all of the background information is documented here.
As I accumulated more information about these processors, it started to become apparent that the C-states supported in the Core 2 model(s) are far more complex than those in both earlier and later processors. These are known as Enhanced C-states (or "CxE"), which involve the package, individual cores and other components on the chipset (e.g., memory). At the time the intel_idle
driver was released, the code was not particularly mature and several Core 2 processors had been released that had conflicting C-state support.
Some compelling information on Core 2 Solo/Duo C-state support was found in this article from 2006. This is in relation to support on Windows, however it does indicate the robust hardware C-state support on these processors. The information regarding Kentsfield conflicts with the actual model number, so I believe they are actually referring to a Yorkfield below:
...the quad-core Intel Core 2 Extreme (Kentsfield) processor supports
all five performance and power saving technologies â Enhanced Intel
SpeedStep (EIST), Thermal Monitor 1 (TM1) and Thermal Monitor 2 (TM2),
old On-Demand Clock Modulation (ODCM), as well as Enhanced C States
(CxE). Compared to Intel Pentium 4 and Pentium D 600, 800, and 900
processors, which are characterized only by Enhanced Halt (C1) State,
this function has been expanded in Intel Core 2 processors (as well as
Intel Core Solo/Duo processors) for all possible idle states of a
processor, including Stop Grant (C2), Deep Sleep (C3), and Deeper
Sleep (C4).
This article from 2008 outlines support for per-core C-states on multi-core Intel processors, including Core 2 Duo and Core 2 Quad (additional helpful background reading was found in this white paper from Dell):
A core C-state is a hardware C-state. There are several core idle
states, e.g. CC1 and CC3. As we know, a modern state of the art
processor has multiple cores, such as the recently released Core Duo
T5000/T7000 mobile processors, known as Penryn in some circles. What
we used to think of as a CPU / processor, actually has multiple
general purpose CPUs in side of it. The Intel Core Duo has 2 cores in
the processor chip. The Intel Core-2 Quad has 4 such cores per
processor chip. Each of these cores has its own idle state. This makes
sense as one core might be idle while another is hard at work on a
thread. So a core C-state is the idle state of one of those cores.
I found a 2010 presentation from Intel that provides some additional background about the intel_idle
driver, but unfortunately does not explain the lack of support for Core 2:
This EXPERIMENTAL driver supersedes acpi_idle on Intel Atom
Processors, Intel Core i3/i5/i7 Processors and associated Intel Xeon
processors. It does not support the Intel Core2 processor or earlier.
The above presentation does indicate that the intel_idle
driver is an implementation of the "menu" CPU governor, which has an impact on Linux kernel configuration (i.e., CONFIG_CPU_IDLE_GOV_LADDER
vs. CONFIG_CPU_IDLE_GOV_MENU
). The differences between the ladder and menu governors are succinctly described in this answer.
Dell has a helpful article that lists C-state C0 to C6 compatibility:
Modes C1 to C3 work by basically cutting clock signals used inside the
CPU, while modes C4 to C6 work by reducing the CPU voltage. "Enhanced"
modes can do both at the same time.
Mode Name CPUs
C0 Operating State All CPUs
C1 Halt 486DX4 and above
C1E Enhanced Halt All socket LGA775 CPUs
C1E â Turion 64, 65-nm Athlon X2 and Phenom CPUs
C2 Stop Grant 486DX4 and above
C2 Stop Clock Only 486DX4, Pentium, Pentium MMX, K5, K6, K6-2, K6-III
C2E Extended Stop Grant Core 2 Duo and above (Intel only)
C3 Sleep Pentium II, Athlon and above, but not on Core 2 Duo E4000 and E6000
C3 Deep Sleep Pentium II and above, but not on Core 2 Duo E4000 and E6000; Turion 64
C3 AltVID AMD Turion 64
C4 Deeper Sleep Pentium M and above, but not on Core 2 Duo E4000 and E6000 series; AMD Turion 64
C4E/C5 Enhanced Deeper Sleep Core Solo, Core Duo and 45-nm mobile Core 2 Duo only
C6 Deep Power Down 45-nm mobile Core 2 Duo only
From this table (which I later found to be incorrect in some cases), it appears that there were a variety of differences in C-state support with the Core 2 processors (Note that nearly all Core 2 processors are Socket LGA775, except for Core 2 Solo SU3500, which is Socket BGA956 and Merom/Penryn processors. "Intel Core" Solo/Duo processors are one of Socket PBGA479 or PPGA478).
An additional exception to the table was found in this article:
IntelâÂÂs Core 2 Duo E8500 supports C-states C2 and C4, while the Core 2
Extreme QX9650 does not.
Interestingly, the QX9650 is a Yorkfield processor (Intel family 6, model 23, stepping 6). For reference, my Q9550S is Intel family 6, model 23 (0x17), stepping 10, which supposedly supports C-state C4 (confirmed through experimentation). Additionally, the Core 2 Solo U3500 has an identical CPUID (family, model, stepping) to the Q9550S but is available in a non-LGA775 socket, which confounds interpretation of the above table.
Clearly, the CPUID must be used at least down to the stepping in order to identify C-state support for this model of processor, and in some cases that may be insufficient (undetermined at this time).
The method signature for assigning CPU idle information is:
#define ICPU(model, cpu)
X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, (unsigned long)&cpu
Where model
is enumerated in asm/intel-family.h. Examining this header file, I see that Intel CPUs are assigned 8-bit identifiers that appear to match the Intel family 6 model numbers:
#define INTEL_FAM6_CORE2_PENRYN 0x17
From the above, we have Intel Family 6, Model 23 (0x17) defined as INTEL_FAM6_CORE2_PENRYN
. This should be sufficient for defining idle states for most of the Model 23 processors, but could potentially cause issues with QX9650 as noted above.
So, minimally, each group of processors that has a distinct C-state set would need to be defined in this list.
Zagacki and Ponnala, Intel Technology Journal 12(3):219-227, 2008 indicate that Yorkfield processors do indeed support C2 and C4. They also seem to indicate that the ACPI 3.0a specification supports transitions only between C-states C0, C1, C2 and C3, which I presume may also limit the Linux acpi_idle
driver to transitions between that limited set of C-states. However, this article indicates that may not always be the case:
Bear in mind that is the ACPI C state, not the processor one, so ACPI
C3 might be HW C6, etc.
Also of note:
Beyond the processor itself, since C4 is a synchronized effort between
major silicon components in the platform, the Intel Q45 Express
Chipset achieves a 28-percent power improvement.
The chipset I'm using is indeed an Intel Q45 Express Chipset.
The Intel documentation on MWAIT states is terse but confirms the BIOS-specific ACPI behavior:
The processor-specific C-states defined in MWAIT extensions can map to
ACPI defined C-state types (C0, C1, C2, C3). The mapping relationship
depends on the definition of a C-state by processor implementation and
is exposed to OSPM by the BIOS using the ACPI defined _CST table.
My interpretation of the above table (combined with a table from Wikipedia, asm/intel-family.h and the above articles) is:
Model 9 0x09 (Pentium M and Celeron M):
- Banias: C0, C1, C2, C3, C4
Model 13 0x0D (Pentium M and Celeron M):
- Dothan, Stealey: C0, C1, C2, C3, C4
Model 14 0x0E INTEL_FAM6_CORE_YONAH (Enhanced Pentium M, Enhanced Celeron M or Intel Core):
- Yonah (Core Solo, Core Duo): C0, C1, C2, C3, C4, C4E/C5
Model 15 0x0F INTEL_FAM6_CORE2_MEROM (some Core 2 and Pentium Dual-Core):
- Kentsfield, Merom, Conroe, Allendale (E2xxx/E4xxx and Core 2 Duo E6xxx, T7xxxx/T8xxxx, Core 2 Extreme QX6xxx, Core 2 Quad Q6xxx): C0, C1, C1E, C2, C2E
Model 23 0x17 INTEL_FAM6_CORE2_PENRYN (Core 2):
- Merom-L/Penryn-L: ?
- Penryn (Core 2 Duo 45-nm mobile): C0, C1, C1E, C2, C2E, C3, C4, C4E/C5, C6
- Yorkfield (Core 2 Extreme QX9650): C0, C1, C1E, C2E?, C3
- Wolfdale/Yorkfield (Core 2 Quad, C2Q Xeon, Core 2 Duo E5xxx/E7xxx/E8xxx, Pentium Dual-Core E6xxx, Celeron Dual-Core): C0, C1, C1E, C2, C2E, C3, C4
From the amount of diversity in C-state support within just the Core 2 line of processors, it appears that a lack of consistent support for C-states may have been the reason for not attempting to fully support them via the intel_idle
driver. I would like to fully complete the above list for the entire Core 2 line.
This is not really a satisfying answer, because it makes me wonder how much unnecessary power is used and excess heat has been (and still is) generated by not fully utilizing the robust power-saving MWAIT C-states on these processors.
Chattopadhyay et al. 2018, Energy Efficient High Performance Processors: Recent Approaches for Designing Green High Performance Computing is worth noting for the specific behavior I'm looking for in the Q45 Express Chipset:
Package C-state (PC0-PC10) - When the compute domains, Core and
Graphics (GPU) are idle, the processor has an opportunity for
additional power savings at uncore and platform levels, for example,
flushing the LLC and power-gating the memory controller and DRAM IO,
and at some state, the whole processor can be turned off while its
state is preserved on always-on power domain.
As a test, I inserted the following at linux/drivers/idle/intel_idle.c line 127:
static struct cpuidle_state conroe_cstates =
.name = "C1",
.desc = "MWAIT 0x00",
.flags = MWAIT2flg(0x00),
.exit_latency = 3,
.target_residency = 6,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C1E",
.desc = "MWAIT 0x01",
.flags = MWAIT2flg(0x01),
.exit_latency = 10,
.target_residency = 20,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
//
// .name = "C2",
// .desc = "MWAIT 0x10",
// .flags = MWAIT2flg(0x10),
// .exit_latency = 20,
// .target_residency = 40,
// .enter = &intel_idle,
// .enter_s2idle = intel_idle_s2idle, ,
.name = "C2E",
.desc = "MWAIT 0x11",
.flags = MWAIT2flg(0x11),
.exit_latency = 40,
.target_residency = 100,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.enter = NULL
;
static struct cpuidle_state core2_cstates =
.name = "C1",
.desc = "MWAIT 0x00",
.flags = MWAIT2flg(0x00),
.exit_latency = 3,
.target_residency = 6,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C1E",
.desc = "MWAIT 0x01",
.flags = MWAIT2flg(0x01),
.exit_latency = 10,
.target_residency = 20,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C2",
.desc = "MWAIT 0x10",
.flags = MWAIT2flg(0x10),
.exit_latency = 20,
.target_residency = 40,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C2E",
.desc = "MWAIT 0x11",
.flags = MWAIT2flg(0x11),
.exit_latency = 40,
.target_residency = 100,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.name = "C3",
.desc = "MWAIT 0x20",
.flags = MWAIT2flg(0x20) ,
.name = "C4",
.desc = "MWAIT 0x30",
.flags = MWAIT2flg(0x30) ,
CPUIDLE_FLAG_TLB_FLUSHED,
.exit_latency = 100,
.target_residency = 400,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
CPUIDLE_FLAG_TLB_FLUSHED,
.exit_latency = 200,
.target_residency = 800,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, ,
.enter = NULL
;
at intel_idle.c
line 983:
static const struct idle_cpu idle_cpu_conroe =
.state_table = conroe_cstates,
.disable_promotion_to_c1e = false,
;
static const struct idle_cpu idle_cpu_core2 =
.state_table = core2_cstates,
.disable_promotion_to_c1e = false,
;
at intel_idle.c
line 1073:
ICPU(INTEL_FAM6_CORE2_MEROM, idle_cpu_conroe),
ICPU(INTEL_FAM6_CORE2_PENRYN, idle_cpu_core2),
After a quick compile and reboot of my PXE nodes, dmesg
now shows:
[ 0.019845] cpuidle: using governor menu
[ 0.515785] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[ 0.543404] intel_idle: MWAIT substates: 0x22220
[ 0.543405] intel_idle: v0.4.1 model 0x17
[ 0.543413] tsc: Marking TSC unstable due to TSC halts in idle states deeper than C2
[ 0.543680] intel_idle: lapic_timer_reliable_states 0x2
And now PowerTOP is showing:
Package | CPU 0
POLL 2.5% | POLL 0.0% 0.0 ms
C1E 2.9% | C1E 5.0% 22.4 ms
C2 0.4% | C2 0.2% 0.2 ms
C3 2.1% | C3 1.9% 0.5 ms
C4E 89.9% | C4E 92.6% 66.5 ms
| CPU 1
| POLL 10.0% 400.8 ms
| C1E 5.1% 6.4 ms
| C2 0.3% 0.1 ms
| C3 1.4% 0.6 ms
| C4E 76.8% 73.6 ms
| CPU 2
| POLL 0.0% 0.2 ms
| C1E 1.1% 3.7 ms
| C2 0.2% 0.2 ms
| C3 3.9% 1.3 ms
| C4E 93.1% 26.4 ms
| CPU 3
| POLL 0.0% 0.7 ms
| C1E 0.3% 0.3 ms
| C2 1.1% 0.4 ms
| C3 1.1% 0.5 ms
| C4E 97.0% 45.2 ms
I've finally accessed the Enhanced Core 2 C-states, and now that my cluster is rebooted the fan noise has dropped to almost nothing in this room. And it looks like there is a measurable drop in power consumption - my meter on 8 nodes appears to be averaging at least 5% lower (with one node still running the old kernel), but I'll try swapping the kernels out again as a test.
An interesting note regarding C4E support - My Yorktown Q9550S processor appears to support it (or some other sub-state of C4), as evidenced above! This confuses me, because the Intel datasheet on the Core 2 Q9000 processor (section 6.2) only mentions C-states Normal (C0), HALT (C1 = 0x00), Extended HALT (C1E = 0x01), Stop Grant (C2 = 0x10), Extended Stop Grant (C2E = 0x11), Sleep/Deep Sleep (C3 = 0x20) and Deeper Sleep (C4 = 0x30). What is this additional 0x31 state? If I enable state C2, then C4E is used instead of C4. If I disable state C2 (force state C2E) then C4 is used instead of C4E. I suspect this may have something to do with the MWAIT flags, but I haven't yet found documentation for this behavior.
I'm not certain what to make of this: The C1E state appears to be used in lieu of C1, C2 is used in lieu of C2E and C4E is used in lieu of C4. I'm uncertain if C1/C1E, C2/C2E and C4/C4E can be used together with intel_idle
or if they are redundant. I found a note in this 2010 presentation by Intel Labs Pittsburgh that indicates the transitions are C0 - C1 - C0 - C1E - C0, and further states:
C1E is only used when all the cores are in C1E
I believe that is to be interpreted as the C1E state is entered on other components (e.g. memory) only when all cores are in the C1E state. I also take this to apply equivalently to the C2/C2E and C4/C4E states (Although C4E is referred to as "C4E/C5" so I'm uncertain if C4E is a sub-state of C4 or if C5 is a sub-state of C4E. Testing seems to indicate C4/C4E is correct). I can force C2E to be used by commenting out the C2 state - however, this causes the C4 state to be used instead of C4E (more work may be required here). Hopefully there aren't any model 15 or model 23 processors that lack state C2E, because those processors would be limited to C1/C1E with the above code.
Also, the flags, latency and residency values could probably stand to be fine-tuned, but just taking educated guesses based on the Nehalem idle values seems to work fine. More reading will be required to make any improvements.
I tested this on a Core 2 Duo E2220 (Allendale), a Dual Core Pentium E5300 (Wolfdale), Core 2 Duo E7400, Core 2 Duo E8400 (Wolfdale), Core 2 Quad Q9550S (Yorkfield) and Core 2 Extreme QX9650, and I have found no issues beyond the afore-mentioned preference for state C2/C2E and C4/C4E.
Not covered by this driver modification:
- The original Core Solo/Core Duo (Yonah, non Core 2) are family 6, model 14. This is good because they supported the C4E/C5 (Enhanced Deep Sleep) C-states but not the C1E/C2E states and would need their own idle definition.
The only issues that I can think of are:
- Core 2 Solo SU3300/SU3500 (Penryn-L) are family 6, model 23 and will be detected by this driver. However, they are not Socket LGA775 so they may not support the C1E Enhanced Halt C-state. Likewise for the Core 2 Solo ULV U2100/U2200 (Merom-L). However, the
intel_idle
driver appears to choose the appropriate C1/C1E based on hardware support of the sub-states. - Core 2 Extreme QX9650 (Yorkfield) reportedly does not support C-state C2 or C4. I have confirmed this by purchasing a used Optiplex 780 and QX9650 Extreme processor on eBay. The processor supports C-states C1 and C1E. With this driver modification, the CPU idles in state C1E instead of C1, so there is presumably some power savings. I expected to see C-state C3, but it is not present when using this driver so I may need to look into this further.
I managed to find a slide from a 2009 Intel presentation on the transitions between C-states (i.e., Deep Power Down):
In conclusion, it turns out that there was no real reason for the lack of Core 2 support in the intel_idle
driver. It is clear now that the original stub code for "Core 2 Duo" only handled C-states C1 and C2, which would have been far less efficient than the acpi_idle
function which also handles C-state C3. Once I knew where to look, implementing support was easy. The helpful comments and other answers were much appreciated, and if Amazon is listening, you know where to send the check.
This update has been committed to github. I will e-mail a patch to the LKML soon.
Update: I also managed to dig up a Socket T/LGA775 Allendale (Conroe) Core 2 Duo E2220, which is family 6, model 15, so I added support for that as well. This model lacks support for C-state C4, but supports C1/C1E and C2/C2E. This should also work for other Conroe-based chips (E4xxx/E6xxx) and possibly all Kentsfield and Merom (non Merom-L) processors.
Update: I finally found some MWAIT tuning resources. This Power vs. Performance writeup and this Deeper C states and increased latency blog post both contain some useful information on identifying CPU idle latencies. Unfortunately, this only reports those exit latencies that were coded into the kernel (but, interestingly, only those hardware states supported by the processor):
# cd /sys/devices/system/cpu/cpu0/cpuidle
# for state in `ls -d state*` ; do echo c-$state `cat $state/name` `cat $state/latency` ; done
c-state0/ POLL 0
c-state1/ C1 3
c-state2/ C1E 10
c-state3/ C2 20
c-state4/ C2E 40
c-state5/ C3 20
c-state6/ C4 60
c-state7/ C4E 100
edited Aug 7 at 0:04
answered Jul 12 at 20:23
vallismortis
3781216
3781216
4
That is nice detective work! I had forgotten how complex the C2D/C2Q C-states were. Re untapped power savings, if your firmware is good enough then you should still be getting the benefit of at least some of the C-states viaacpi_idle
and the various performance governors. What states doespowertop
show on your system?
â Stephen Kitt
Jul 12 at 21:07
1
Very nice information, have you considered proposing your patch to the upstream Linux kernel?
â Lekensteyn
Jul 13 at 9:10
1
"The C1E state appears to be used in lieu of C1..." Which state is used - as shown by powertop - is determined solely by the kernel, therefore I believe it will not "have something to do with the MWAIT flags", it will be chosen solely based on the order of the states and the exit_latency and target_residency. That said, I would be slightly concerned about leave states in the table if they didn't seem to get used when tested... in case those states didn't actually work as expected, and there was some other workload pattern that led to them being used & the unexpected behaviour happening.
â sourcejedi
Jul 15 at 14:13
1
"the transitions are C0 - C1 - C0 - C1E - C0" - I don't think that's a good description of that slide. From the kernel /powertop
point of view, all transitions are either from C0 or to C0. If you're not in C0, you're not running any instructions, therefore the kernel cannot either observe or request any transition between states on that cpu :-). And as you say, the kernel "menu" governor may well e.g. jump straight into C1E, without spending any time in C1 first.
â sourcejedi
Jul 15 at 14:29
1
"just taking educated guesses based on the Nehalem idle values seems to work fine" - note this is not a good way to get your patch accepted upstream :-P, in that the exit latency must not be an underestimate, otherwise I think you will violate PM_QOS_CPU_DMA_LATENCY, which may be set by drivers (or userspace?)
â sourcejedi
Jul 15 at 14:41
 |Â
show 4 more comments
4
That is nice detective work! I had forgotten how complex the C2D/C2Q C-states were. Re untapped power savings, if your firmware is good enough then you should still be getting the benefit of at least some of the C-states viaacpi_idle
and the various performance governors. What states doespowertop
show on your system?
â Stephen Kitt
Jul 12 at 21:07
1
Very nice information, have you considered proposing your patch to the upstream Linux kernel?
â Lekensteyn
Jul 13 at 9:10
1
"The C1E state appears to be used in lieu of C1..." Which state is used - as shown by powertop - is determined solely by the kernel, therefore I believe it will not "have something to do with the MWAIT flags", it will be chosen solely based on the order of the states and the exit_latency and target_residency. That said, I would be slightly concerned about leave states in the table if they didn't seem to get used when tested... in case those states didn't actually work as expected, and there was some other workload pattern that led to them being used & the unexpected behaviour happening.
â sourcejedi
Jul 15 at 14:13
1
"the transitions are C0 - C1 - C0 - C1E - C0" - I don't think that's a good description of that slide. From the kernel /powertop
point of view, all transitions are either from C0 or to C0. If you're not in C0, you're not running any instructions, therefore the kernel cannot either observe or request any transition between states on that cpu :-). And as you say, the kernel "menu" governor may well e.g. jump straight into C1E, without spending any time in C1 first.
â sourcejedi
Jul 15 at 14:29
1
"just taking educated guesses based on the Nehalem idle values seems to work fine" - note this is not a good way to get your patch accepted upstream :-P, in that the exit latency must not be an underestimate, otherwise I think you will violate PM_QOS_CPU_DMA_LATENCY, which may be set by drivers (or userspace?)
â sourcejedi
Jul 15 at 14:41
4
4
That is nice detective work! I had forgotten how complex the C2D/C2Q C-states were. Re untapped power savings, if your firmware is good enough then you should still be getting the benefit of at least some of the C-states via
acpi_idle
and the various performance governors. What states does powertop
show on your system?â Stephen Kitt
Jul 12 at 21:07
That is nice detective work! I had forgotten how complex the C2D/C2Q C-states were. Re untapped power savings, if your firmware is good enough then you should still be getting the benefit of at least some of the C-states via
acpi_idle
and the various performance governors. What states does powertop
show on your system?â Stephen Kitt
Jul 12 at 21:07
1
1
Very nice information, have you considered proposing your patch to the upstream Linux kernel?
â Lekensteyn
Jul 13 at 9:10
Very nice information, have you considered proposing your patch to the upstream Linux kernel?
â Lekensteyn
Jul 13 at 9:10
1
1
"The C1E state appears to be used in lieu of C1..." Which state is used - as shown by powertop - is determined solely by the kernel, therefore I believe it will not "have something to do with the MWAIT flags", it will be chosen solely based on the order of the states and the exit_latency and target_residency. That said, I would be slightly concerned about leave states in the table if they didn't seem to get used when tested... in case those states didn't actually work as expected, and there was some other workload pattern that led to them being used & the unexpected behaviour happening.
â sourcejedi
Jul 15 at 14:13
"The C1E state appears to be used in lieu of C1..." Which state is used - as shown by powertop - is determined solely by the kernel, therefore I believe it will not "have something to do with the MWAIT flags", it will be chosen solely based on the order of the states and the exit_latency and target_residency. That said, I would be slightly concerned about leave states in the table if they didn't seem to get used when tested... in case those states didn't actually work as expected, and there was some other workload pattern that led to them being used & the unexpected behaviour happening.
â sourcejedi
Jul 15 at 14:13
1
1
"the transitions are C0 - C1 - C0 - C1E - C0" - I don't think that's a good description of that slide. From the kernel /
powertop
point of view, all transitions are either from C0 or to C0. If you're not in C0, you're not running any instructions, therefore the kernel cannot either observe or request any transition between states on that cpu :-). And as you say, the kernel "menu" governor may well e.g. jump straight into C1E, without spending any time in C1 first.â sourcejedi
Jul 15 at 14:29
"the transitions are C0 - C1 - C0 - C1E - C0" - I don't think that's a good description of that slide. From the kernel /
powertop
point of view, all transitions are either from C0 or to C0. If you're not in C0, you're not running any instructions, therefore the kernel cannot either observe or request any transition between states on that cpu :-). And as you say, the kernel "menu" governor may well e.g. jump straight into C1E, without spending any time in C1 first.â sourcejedi
Jul 15 at 14:29
1
1
"just taking educated guesses based on the Nehalem idle values seems to work fine" - note this is not a good way to get your patch accepted upstream :-P, in that the exit latency must not be an underestimate, otherwise I think you will violate PM_QOS_CPU_DMA_LATENCY, which may be set by drivers (or userspace?)
â sourcejedi
Jul 15 at 14:41
"just taking educated guesses based on the Nehalem idle values seems to work fine" - note this is not a good way to get your patch accepted upstream :-P, in that the exit latency must not be an underestimate, otherwise I think you will violate PM_QOS_CPU_DMA_LATENCY, which may be set by drivers (or userspace?)
â sourcejedi
Jul 15 at 14:41
 |Â
show 4 more comments
up vote
7
down vote
Is there a more appropriate way to configure a kernel for optimal CPU idle support for this family of processors (aside from disabling support for intel_idle)
You have ACPI enabled, and you've checked that acpi_idle is in use. I sincerely doubt you have missed any helpful kernel config option. You can always check powertop
for possible suggestions, but probably you already knew that.
This is not an answer, but I want to format it :-(.
Looking at the kernel source code, the current intel_idle driver contains a test to specifically exclude Intel family 6 from the driver.
No it doesn't :-).
id = x86_match_cpu(intel_idle_ids);
if (!id)
if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
boot_cpu_data.x86 == 6)
pr_debug(PREFIX "does not run on family %d model %dn",
boot_cpu_data.x86, boot_cpu_data.x86_model);
return -ENODEV;
The if
statement does not exclude Family 6. Instead, the if
statement provides a message when debugging is enabled, that this specific modern Intel CPU is not supported by intel_idle
. In fact, my current i5-5300U CPU is Family 6 and it uses intel_idle
.
What excludes your CPU is that there is no match in the intel_idle_ids
table.
I noticed this commit which implemented the table. The code it removes had a switch
statement instead. This makes it easy to see that the earliest model intel_idle has been implemented/successfully tested/whatever is 0x1A = 26. https://github.com/torvalds/linux/commit/b66b8b9a4a79087dde1b358a016e5c8739ccf186
Thank you for noting the specific test is based on theintel_idle_ids
table - I've adjusted the phrasing of the question, which still stands regarding Core 2/Yorkfield support.
â vallismortis
Jul 12 at 15:06
This article provides additional background and usage information for the PowerTOP command.
â vallismortis
Jul 12 at 20:55
add a comment |Â
up vote
7
down vote
Is there a more appropriate way to configure a kernel for optimal CPU idle support for this family of processors (aside from disabling support for intel_idle)
You have ACPI enabled, and you've checked that acpi_idle is in use. I sincerely doubt you have missed any helpful kernel config option. You can always check powertop
for possible suggestions, but probably you already knew that.
This is not an answer, but I want to format it :-(.
Looking at the kernel source code, the current intel_idle driver contains a test to specifically exclude Intel family 6 from the driver.
No it doesn't :-).
id = x86_match_cpu(intel_idle_ids);
if (!id)
if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
boot_cpu_data.x86 == 6)
pr_debug(PREFIX "does not run on family %d model %dn",
boot_cpu_data.x86, boot_cpu_data.x86_model);
return -ENODEV;
The if
statement does not exclude Family 6. Instead, the if
statement provides a message when debugging is enabled, that this specific modern Intel CPU is not supported by intel_idle
. In fact, my current i5-5300U CPU is Family 6 and it uses intel_idle
.
What excludes your CPU is that there is no match in the intel_idle_ids
table.
I noticed this commit which implemented the table. The code it removes had a switch
statement instead. This makes it easy to see that the earliest model intel_idle has been implemented/successfully tested/whatever is 0x1A = 26. https://github.com/torvalds/linux/commit/b66b8b9a4a79087dde1b358a016e5c8739ccf186
Thank you for noting the specific test is based on theintel_idle_ids
table - I've adjusted the phrasing of the question, which still stands regarding Core 2/Yorkfield support.
â vallismortis
Jul 12 at 15:06
This article provides additional background and usage information for the PowerTOP command.
â vallismortis
Jul 12 at 20:55
add a comment |Â
up vote
7
down vote
up vote
7
down vote
Is there a more appropriate way to configure a kernel for optimal CPU idle support for this family of processors (aside from disabling support for intel_idle)
You have ACPI enabled, and you've checked that acpi_idle is in use. I sincerely doubt you have missed any helpful kernel config option. You can always check powertop
for possible suggestions, but probably you already knew that.
This is not an answer, but I want to format it :-(.
Looking at the kernel source code, the current intel_idle driver contains a test to specifically exclude Intel family 6 from the driver.
No it doesn't :-).
id = x86_match_cpu(intel_idle_ids);
if (!id)
if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
boot_cpu_data.x86 == 6)
pr_debug(PREFIX "does not run on family %d model %dn",
boot_cpu_data.x86, boot_cpu_data.x86_model);
return -ENODEV;
The if
statement does not exclude Family 6. Instead, the if
statement provides a message when debugging is enabled, that this specific modern Intel CPU is not supported by intel_idle
. In fact, my current i5-5300U CPU is Family 6 and it uses intel_idle
.
What excludes your CPU is that there is no match in the intel_idle_ids
table.
I noticed this commit which implemented the table. The code it removes had a switch
statement instead. This makes it easy to see that the earliest model intel_idle has been implemented/successfully tested/whatever is 0x1A = 26. https://github.com/torvalds/linux/commit/b66b8b9a4a79087dde1b358a016e5c8739ccf186
Is there a more appropriate way to configure a kernel for optimal CPU idle support for this family of processors (aside from disabling support for intel_idle)
You have ACPI enabled, and you've checked that acpi_idle is in use. I sincerely doubt you have missed any helpful kernel config option. You can always check powertop
for possible suggestions, but probably you already knew that.
This is not an answer, but I want to format it :-(.
Looking at the kernel source code, the current intel_idle driver contains a test to specifically exclude Intel family 6 from the driver.
No it doesn't :-).
id = x86_match_cpu(intel_idle_ids);
if (!id)
if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
boot_cpu_data.x86 == 6)
pr_debug(PREFIX "does not run on family %d model %dn",
boot_cpu_data.x86, boot_cpu_data.x86_model);
return -ENODEV;
The if
statement does not exclude Family 6. Instead, the if
statement provides a message when debugging is enabled, that this specific modern Intel CPU is not supported by intel_idle
. In fact, my current i5-5300U CPU is Family 6 and it uses intel_idle
.
What excludes your CPU is that there is no match in the intel_idle_ids
table.
I noticed this commit which implemented the table. The code it removes had a switch
statement instead. This makes it easy to see that the earliest model intel_idle has been implemented/successfully tested/whatever is 0x1A = 26. https://github.com/torvalds/linux/commit/b66b8b9a4a79087dde1b358a016e5c8739ccf186
edited Jul 12 at 14:09
answered Jul 12 at 13:52
sourcejedi
18k22375
18k22375
Thank you for noting the specific test is based on theintel_idle_ids
table - I've adjusted the phrasing of the question, which still stands regarding Core 2/Yorkfield support.
â vallismortis
Jul 12 at 15:06
This article provides additional background and usage information for the PowerTOP command.
â vallismortis
Jul 12 at 20:55
add a comment |Â
Thank you for noting the specific test is based on theintel_idle_ids
table - I've adjusted the phrasing of the question, which still stands regarding Core 2/Yorkfield support.
â vallismortis
Jul 12 at 15:06
This article provides additional background and usage information for the PowerTOP command.
â vallismortis
Jul 12 at 20:55
Thank you for noting the specific test is based on the
intel_idle_ids
table - I've adjusted the phrasing of the question, which still stands regarding Core 2/Yorkfield support.â vallismortis
Jul 12 at 15:06
Thank you for noting the specific test is based on the
intel_idle_ids
table - I've adjusted the phrasing of the question, which still stands regarding Core 2/Yorkfield support.â vallismortis
Jul 12 at 15:06
This article provides additional background and usage information for the PowerTOP command.
â vallismortis
Jul 12 at 20:55
This article provides additional background and usage information for the PowerTOP command.
â vallismortis
Jul 12 at 20:55
add a comment |Â
up vote
6
down vote
I suspect this could just be a case of opportunity and cost. When intel_idle
was added, it seems Core 2 Duo support was planned, but it never was fully implemented â perhaps by the time the Intel engineers got round to it, it wasnâÂÂt worth it any more. The equation is relatively complex: intel_idle
needs to provide sufficient benefits over acpi_idle
to make it worth supporting here, on CPUs which will see the âÂÂimprovedâ kernel in sufficient numbers...
As sourcejediâÂÂs answer says, the driver doesnâÂÂt exclude all of family 6. The intel_idle
initialisation checks for CPUs in a list of CPU models, covering basically all micro-architectures from Nehalem to Kaby Lake. Yorkfield is older than that (and significantly different â Nehalem is very different from the architectures which came before it). The family 6 test only affects whether the error message is printed; its effect is only that the error message will only be displayed on Intel CPUs, not AMD CPUs (Intel family 6 includes all non-NetBurst Intel CPUs since the Pentium Pro).
To answer your configuration question, you could completely disable intel_idle
, but leaving it in is fine too (as long as you donâÂÂt mind the warning).
pr_debug() message should only appear if you do something very specific to enable that debug message, so you don't even have to ignore the warning
â sourcejedi
Jul 12 at 14:01
2
@sourcejedi I mentioned that because the OP is seeing it.
â Stephen Kitt
Jul 12 at 14:02
gotcha. I present a half-serious comment: since we are asked about a sensible kernel config, if it is used day-to-day, maybe don't use the option that enables all debug messages? With the right option, they can be enabled dynamically and selectively when necessary. kernel.org/doc/html/v4.17/admin-guide/dynamic-debug-howto.html If you enable all debug messages, you probably have lots of messages you are ignoring anyway :).
â sourcejedi
Jul 12 at 14:07
@sourcejedi I fail to see the relevance of your comments regarding disabling kernel messages. I don't see this as being constructive to the question, which specifically addresses Core 2 support for theintel_idle
driver.
â vallismortis
Jul 12 at 15:14
@vallismortis it is very tangential. It means that there is valid configuration you can use for Core 2 and above, which does not print this as an annoying warning message which must simply be ignored, and will use intel_idle if supported... but then I suppose you would use dynamically loaded modules anyway, so maybe not worth mentioning.
â sourcejedi
Jul 12 at 15:19
 |Â
show 1 more comment
up vote
6
down vote
I suspect this could just be a case of opportunity and cost. When intel_idle
was added, it seems Core 2 Duo support was planned, but it never was fully implemented â perhaps by the time the Intel engineers got round to it, it wasnâÂÂt worth it any more. The equation is relatively complex: intel_idle
needs to provide sufficient benefits over acpi_idle
to make it worth supporting here, on CPUs which will see the âÂÂimprovedâ kernel in sufficient numbers...
As sourcejediâÂÂs answer says, the driver doesnâÂÂt exclude all of family 6. The intel_idle
initialisation checks for CPUs in a list of CPU models, covering basically all micro-architectures from Nehalem to Kaby Lake. Yorkfield is older than that (and significantly different â Nehalem is very different from the architectures which came before it). The family 6 test only affects whether the error message is printed; its effect is only that the error message will only be displayed on Intel CPUs, not AMD CPUs (Intel family 6 includes all non-NetBurst Intel CPUs since the Pentium Pro).
To answer your configuration question, you could completely disable intel_idle
, but leaving it in is fine too (as long as you donâÂÂt mind the warning).
pr_debug() message should only appear if you do something very specific to enable that debug message, so you don't even have to ignore the warning
â sourcejedi
Jul 12 at 14:01
2
@sourcejedi I mentioned that because the OP is seeing it.
â Stephen Kitt
Jul 12 at 14:02
gotcha. I present a half-serious comment: since we are asked about a sensible kernel config, if it is used day-to-day, maybe don't use the option that enables all debug messages? With the right option, they can be enabled dynamically and selectively when necessary. kernel.org/doc/html/v4.17/admin-guide/dynamic-debug-howto.html If you enable all debug messages, you probably have lots of messages you are ignoring anyway :).
â sourcejedi
Jul 12 at 14:07
@sourcejedi I fail to see the relevance of your comments regarding disabling kernel messages. I don't see this as being constructive to the question, which specifically addresses Core 2 support for theintel_idle
driver.
â vallismortis
Jul 12 at 15:14
@vallismortis it is very tangential. It means that there is valid configuration you can use for Core 2 and above, which does not print this as an annoying warning message which must simply be ignored, and will use intel_idle if supported... but then I suppose you would use dynamically loaded modules anyway, so maybe not worth mentioning.
â sourcejedi
Jul 12 at 15:19
 |Â
show 1 more comment
up vote
6
down vote
up vote
6
down vote
I suspect this could just be a case of opportunity and cost. When intel_idle
was added, it seems Core 2 Duo support was planned, but it never was fully implemented â perhaps by the time the Intel engineers got round to it, it wasnâÂÂt worth it any more. The equation is relatively complex: intel_idle
needs to provide sufficient benefits over acpi_idle
to make it worth supporting here, on CPUs which will see the âÂÂimprovedâ kernel in sufficient numbers...
As sourcejediâÂÂs answer says, the driver doesnâÂÂt exclude all of family 6. The intel_idle
initialisation checks for CPUs in a list of CPU models, covering basically all micro-architectures from Nehalem to Kaby Lake. Yorkfield is older than that (and significantly different â Nehalem is very different from the architectures which came before it). The family 6 test only affects whether the error message is printed; its effect is only that the error message will only be displayed on Intel CPUs, not AMD CPUs (Intel family 6 includes all non-NetBurst Intel CPUs since the Pentium Pro).
To answer your configuration question, you could completely disable intel_idle
, but leaving it in is fine too (as long as you donâÂÂt mind the warning).
I suspect this could just be a case of opportunity and cost. When intel_idle
was added, it seems Core 2 Duo support was planned, but it never was fully implemented â perhaps by the time the Intel engineers got round to it, it wasnâÂÂt worth it any more. The equation is relatively complex: intel_idle
needs to provide sufficient benefits over acpi_idle
to make it worth supporting here, on CPUs which will see the âÂÂimprovedâ kernel in sufficient numbers...
As sourcejediâÂÂs answer says, the driver doesnâÂÂt exclude all of family 6. The intel_idle
initialisation checks for CPUs in a list of CPU models, covering basically all micro-architectures from Nehalem to Kaby Lake. Yorkfield is older than that (and significantly different â Nehalem is very different from the architectures which came before it). The family 6 test only affects whether the error message is printed; its effect is only that the error message will only be displayed on Intel CPUs, not AMD CPUs (Intel family 6 includes all non-NetBurst Intel CPUs since the Pentium Pro).
To answer your configuration question, you could completely disable intel_idle
, but leaving it in is fine too (as long as you donâÂÂt mind the warning).
answered Jul 12 at 13:57
Stephen Kitt
139k22296359
139k22296359
pr_debug() message should only appear if you do something very specific to enable that debug message, so you don't even have to ignore the warning
â sourcejedi
Jul 12 at 14:01
2
@sourcejedi I mentioned that because the OP is seeing it.
â Stephen Kitt
Jul 12 at 14:02
gotcha. I present a half-serious comment: since we are asked about a sensible kernel config, if it is used day-to-day, maybe don't use the option that enables all debug messages? With the right option, they can be enabled dynamically and selectively when necessary. kernel.org/doc/html/v4.17/admin-guide/dynamic-debug-howto.html If you enable all debug messages, you probably have lots of messages you are ignoring anyway :).
â sourcejedi
Jul 12 at 14:07
@sourcejedi I fail to see the relevance of your comments regarding disabling kernel messages. I don't see this as being constructive to the question, which specifically addresses Core 2 support for theintel_idle
driver.
â vallismortis
Jul 12 at 15:14
@vallismortis it is very tangential. It means that there is valid configuration you can use for Core 2 and above, which does not print this as an annoying warning message which must simply be ignored, and will use intel_idle if supported... but then I suppose you would use dynamically loaded modules anyway, so maybe not worth mentioning.
â sourcejedi
Jul 12 at 15:19
 |Â
show 1 more comment
pr_debug() message should only appear if you do something very specific to enable that debug message, so you don't even have to ignore the warning
â sourcejedi
Jul 12 at 14:01
2
@sourcejedi I mentioned that because the OP is seeing it.
â Stephen Kitt
Jul 12 at 14:02
gotcha. I present a half-serious comment: since we are asked about a sensible kernel config, if it is used day-to-day, maybe don't use the option that enables all debug messages? With the right option, they can be enabled dynamically and selectively when necessary. kernel.org/doc/html/v4.17/admin-guide/dynamic-debug-howto.html If you enable all debug messages, you probably have lots of messages you are ignoring anyway :).
â sourcejedi
Jul 12 at 14:07
@sourcejedi I fail to see the relevance of your comments regarding disabling kernel messages. I don't see this as being constructive to the question, which specifically addresses Core 2 support for theintel_idle
driver.
â vallismortis
Jul 12 at 15:14
@vallismortis it is very tangential. It means that there is valid configuration you can use for Core 2 and above, which does not print this as an annoying warning message which must simply be ignored, and will use intel_idle if supported... but then I suppose you would use dynamically loaded modules anyway, so maybe not worth mentioning.
â sourcejedi
Jul 12 at 15:19
pr_debug() message should only appear if you do something very specific to enable that debug message, so you don't even have to ignore the warning
â sourcejedi
Jul 12 at 14:01
pr_debug() message should only appear if you do something very specific to enable that debug message, so you don't even have to ignore the warning
â sourcejedi
Jul 12 at 14:01
2
2
@sourcejedi I mentioned that because the OP is seeing it.
â Stephen Kitt
Jul 12 at 14:02
@sourcejedi I mentioned that because the OP is seeing it.
â Stephen Kitt
Jul 12 at 14:02
gotcha. I present a half-serious comment: since we are asked about a sensible kernel config, if it is used day-to-day, maybe don't use the option that enables all debug messages? With the right option, they can be enabled dynamically and selectively when necessary. kernel.org/doc/html/v4.17/admin-guide/dynamic-debug-howto.html If you enable all debug messages, you probably have lots of messages you are ignoring anyway :).
â sourcejedi
Jul 12 at 14:07
gotcha. I present a half-serious comment: since we are asked about a sensible kernel config, if it is used day-to-day, maybe don't use the option that enables all debug messages? With the right option, they can be enabled dynamically and selectively when necessary. kernel.org/doc/html/v4.17/admin-guide/dynamic-debug-howto.html If you enable all debug messages, you probably have lots of messages you are ignoring anyway :).
â sourcejedi
Jul 12 at 14:07
@sourcejedi I fail to see the relevance of your comments regarding disabling kernel messages. I don't see this as being constructive to the question, which specifically addresses Core 2 support for the
intel_idle
driver.â vallismortis
Jul 12 at 15:14
@sourcejedi I fail to see the relevance of your comments regarding disabling kernel messages. I don't see this as being constructive to the question, which specifically addresses Core 2 support for the
intel_idle
driver.â vallismortis
Jul 12 at 15:14
@vallismortis it is very tangential. It means that there is valid configuration you can use for Core 2 and above, which does not print this as an annoying warning message which must simply be ignored, and will use intel_idle if supported... but then I suppose you would use dynamically loaded modules anyway, so maybe not worth mentioning.
â sourcejedi
Jul 12 at 15:19
@vallismortis it is very tangential. It means that there is valid configuration you can use for Core 2 and above, which does not print this as an annoying warning message which must simply be ignored, and will use intel_idle if supported... but then I suppose you would use dynamically loaded modules anyway, so maybe not worth mentioning.
â sourcejedi
Jul 12 at 15:19
 |Â
show 1 more comment
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f454896%2fwhy-are-some-intel-family-6-cpu-models-core-2-pentium-m-not-supported-by-inte%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password