Wake-Up Latencies For Processor Idle States On Current x86 Processors

Comput Sci Res Dev (2015) 30:219227
DOI 10.1007/s00450-014-0270-z
SPECIAL ISSUE PAPER
Wake-up latencies for processor idle states on current x86

processors
Robert Schne Daniel Molka Michael Werner
Published online: 5 July 2014

Springer-Verlag Berlin Heidelberg 2014
Abstract During the last decades various low-power states reflected in the battery life of laptops that increases with
have been implemented in processors. They can be used by each generation while the performance is typically improved
the operating system to reduce the power consumption. The as well. That evolution is facilitated by several low-power
applied power saving mechanisms include load-dependent states that reduce the power consumption if the full perfor-
frequency and voltage scaling as well as the temporary deac- mance is not required. Many power saving technologies that
tivation of unused components. These techniques reduce the were originally developed for mobile devices have already
power consumption and thereby enable energy efficiency been adopted by processors used in HPC systems. Therefore,
improvements if the system is not used to full capacity. How- these mechanisms can be used to improve the efficiency of
ever, an inappropriate usage of low-power states can signif- scientific applications as well.
icantly degrade the performance. The time required to re- The power management is typically administrated by
establish full performance can be significant. Therefore, deep the operating system. However, developments in the last
idle states are occasionally disabled, especially if applica- decade (e.g., [7,9,10,22,23,27]) showed that direct con-
tions have real-time requirements. In this paper, we describe trol of these features facilitates significant energy-efficiency
how low-power states are implemented in current x86 proces- improvements. Therefore, the configuration of low-power
sors. We then measure the wake-up latencies of various low- states is partially exposed to the user by contemporary oper-
power states that occur when a processor core is reactivated. ating system, e.g., via the cpufreq and the cpuidle subsys-
Finally, we compare our results to the vendors specifications tem in the Linux kernel. This enables explicit control of
that are exposed to the operating system. certain features, e.g., dynamic voltage and frequency scal-
ing (DVFS). However, the usage of these states has a sig-
Keywords Power management Idle-times C-state nificant influence on the system performance, thus they
Idle-state Performance measurement have to be used conservatively if performance is not to be
hurt.
1 Introduction Information about the overhead of using low-power states
is essential to use them efficiently. In this paper we examine
Energy efficiency is one of the major design goals in the the delay that occurs when processors in low-power states
development of computing systems. This is for example are reactivated. The results can be used for example to decide
which low-power state to use for individual collective oper-
R. Schne (B) D. Molka M. Werner ations in parallel programs. Another possible use case of our
Center for Information Services and High Performance data is to estimate if the usage of a certain low-power state
Computing (ZIH), Technische Universitt Dresden,
would violate given real-time requirements.
01062 Dresden, Germany
e-mail: robert.schoene@tu-dresden.de The remainder of this paper is structured as follows: in
Sect. 2 we discuss related work regarding the functional-
D. Molka
e-mail: daniel.molka@tu-dresden.de ity and utilization of low-power states. In Sect. 3 we detail
how current x86 processors implement low-power states. We
M. Werner
e-mail: michael.werner3@tu-dresden.de introduce our methodology in Sect. 4 and present the results
123
220 R. Schne et al.
for several test systems in Sect. 5. We close this paper with C0 is the active state in which the processor is executing
a summary and an outlook in Sect. 6. instructions at the performance level defined by the P-state.
In higher C-states the processing of instructions is paused
2 Background and related work and the power consumption is reduced. Since the processors
are not operational, more aggressive power saving techniques
The power consumption of processors consists of the such as clock and power gating of whole cores can be used.
dynamic switching power and the static power that is However, this can also lead to a high latency to re-enable the
among other thingscaused by leakage currents. The power processor. Therefore, C-states are used by the operating sys-
can be approximated using equation (1) where V is the supply tem only when it expects a certain timespan to elapse until the
voltage, Istatic is the static current, C is the load capacitance, next task becomes ready to be executed. Unlike P-states, C-
f is the clock frequency, and is the activity factor, i.e., the states can not be used directly by applications. However, they
fraction of transistors that switch each cycle on average [5]. can be used indirectly by eliminating load on some proces-
sor cores, e.g. via dynamic concurrency throttling (DCT) [8]
P = C V 2 f + Istatic V (1) and idle waiting policies. The ACPI standard defines four
different C-States:
There are multiple techniques to reduce the power con-
sumption. DVFS [6] trades performance in for power reduc- C0: the processor is ready to execute work
tion by reducing the frequency. This directly reduces the C1: (halt state) the processor is paused, but can switch to
dynamic power. Furthermore, a lower frequency requires a C0 immediately
lower voltage which enables additional savings for both the C2: the processor can switch to C0 only with a certain
dynamic and static power. Clock gating removes the clock delay, however it still responds to cache coherence traffic.
signal from a circuit. Thus, the frequency f and consequently C3: the processor can switch to C0 with a significant
the dynamic power of the clock gated section are effectively delay and does not respond to cache coherence traffic.
zero. Furthermore, clock gating enables significant savings
in the clock distribution network itself [28]. However, static While the usage of P- and C-states reduces power con-
power is not affected by clock gating. Power gating [26] cuts sumption, it also reduces performance. The switch to a new
off the supply voltage. Therefore, dynamic as well as static P-state for example takes a certain time span before it is real-
power are virtually zero. ized in hardware. The delay is mainly caused by the voltage
The advanced configuration and power interface (ACPI) regulators that require some time to adjust the voltage as
specification [1] defines several low power states for proces- well as the frequency multipliers1 that need to adapt to the
sors, devices and computing systems. This specification is new frequency. Mazouz et al. [20] analyzed the transition
used to enable operating system-directed configuration and latency of P-state changes. Similarly, re-enabling a proces-
power management (OSPM). OSPM requires the operating sor core after using an idle state also causes a certain delay.
system to control the power management capabilities of sys- The ACPI standard defines interfaces to apprise the operating
tem components depending on the capacity utilization of the system of these latencies. However, the interface is limited.
system. In ACPI the following processor states are defined: E.g., the wake-up times for C-states are not defined for every
available frequency, and the P-state latencies are not defined
Processor Performance states (P-states), that define dif- for every pair of frequencies. Also the concept of package
ferent performance levels for operating processors and C-states is not known in the ACPI standard.
Processor Power states (C-states), that specify power The usage of ACPI states can be monitored in soft-
saving states for idle processors. ware (e.g., via Linux kernel tracepoints) and hardware (e.g.,
via residency counters). Different performance measurement
P0 is the P-state with the highest performance level and tools implement such monitors [4,25] and help users to
power consumption. Higher P-states provide lower perfor- find optimizable routines, e.g., barriers that are implemented
mance and reduce the power consumption. They are typi- using busy waiting.
cally implemented with DVFS. By default, P-states are used Knobloch et al. [19] extended Scalasca to highlight block-
by operating systems when the load on a processor core is ing MPI communication that spends waiting times in C0,
<100 %. Linux also allows to manually control the proces- thus being not energy-efficient. They suggest the usage of
sors P-states. This can for example be used to throttle the low power states as an energy-efficient alternative. However,
frequency of fully occupied processors when an application
seems to be memory bound [7,23]. However, Schne et al. 1 The variable core clock is typically derived from a constant refer-
[24] showed in that the memory bandwidth can also be ence clock via frequency multiplying circuits, e.g., phase-locked loops
reduced when reducing the processor clock rate. (PLL).
123
Wake-up latencies for processor 221
to determine an appropriate C-state, accurate performance CC1 state the complementary C1E state is enabled. In that
numbers are inevitable to determine the runtime overhead. case the voltage for the whole package is reduced to enhance
The same tagging of inefficient wait states can be performed the power savings. When a processor core enters the CC3
with Vampir [25] and Paraver/Extra [4]. state, all cache lines are flushed from L1 and L2 caches to the
Real-time systems require a certain responsiveness to ful- shared last level cache. The core keeps its architectural state
fill the real-time limits given by applications. To avoid the and is clock gated. When the CC6 state is entered, the archi-
transition latency when leaving an idle state, power manage- tectural state is flushed as well. It is stored in a special SRAM
ment capabilities are often disabled [2]. Exact numbers on in the Uncore. Afterwards, the core is power gated. The CC7
the overhead would help administrators to choose a more state exhibits the same behavior as the CC6 state.[15], but
energy efficient setup. is used to determine the depth of the package C-state.
The processor enters the PC3 state if all of its cores are in
3 C-states in current x86 processors CC3. The L3 cache is still snoopable [13] or active [15]
in the PC3 state, but memory is put into self-refresh. When
Current x86 processors support idle states beyond the ACPI all cores are in CC6, the package enters the PC6 state. In that
specification. In this section we describe the C-states that are state the L3 cache retains its content but the data is not acces-
available on our test systems (see Table 2). The information is sible. Presumably clock gating is used in the control logic
based on datasheets [3,12,13,15,18] provided by the vendors and/or data paths in that case. Processors based on Sandy
and the Linux cpuidle subsystem. The cpuidle subsystem of Bridge, Ivy Bridge or Haswell microarchitectures implement
Linux establishes sysfs entries that describe the available C- a special package C2 state to answer request from devices and
states. These entries declare for example the name of the idle other processors that can not be processed in deeper package
state, the presumed transition latency, and the usage of the C-states. If a processor receives snoop requests or accesses
C-state and establish an option to disable certain C-states. to its portion of DRAM occur while being in PC3 or PC6 the
The operating system can independently configure the C- processor transitions to PC2 state for processing the requests.
state of every logical CPU. However, the granularity sup- Some recent processors also implement an even deeper pack-
ported by the hardware can be different. If Hyperthreading is age C-statePC7. In this state some power is removed from
enabled, a processor core is in the lowest C-state requested portions of the system agent as well [15]. Additionally, the
for any of the corresponding CPUs (e.g., [17]). Similarly, the last level cache is flushed and power gated.
dual-core compute units in AMD family 15h processors [3] Contemporary Intel processors also implement hardware
only enter deep C-states if both cores are configured accord- features that allow the processor to dynamically decide which
ingly. C-state the processors cores should use. Auto-demotion is
Additionally, Many multicore processors implement pack- used for switching to a lower C-state than requested by the
age C-states that can be entered when all cores within operating system. Auto-promotion allows the processor to
the processor are in a certain C-state. Package C-states use higher C-states. This effectively overrides the decisions
enable additional power savings by partially disabling shared made by the operating system. In [11], Intel documents model
resources, e.g., the last level cache. We denote core C-states specific registers (MSRs) for the configuration of these fea-
as CC< id > and package C-states as PC< id >, respec- tures. The Linux intel_idle driver uses the MSRs to alter
tively. The package C-state is the minimum of all core C- the settings previously specified by the BIOS.
states within a package, e.g., PC3 can only be entered when
all cores are in CC3 or higher. Likewise, if one or more cores 3.2 C-state characteristics in AMD processors
are activei.e., in CC0the package is in PC0. The package
C-state can also be limited by an integrated GPU (e.g., [15, AMD family 15h processors support up to three core C-
Sect. 4.2.6]). states [3]. Each of them can be configured by writing to PCI
registers during the initialization of the processors. The C-
3.1 C-state characteristics in intel processors state specifications are encoded in three 16 Bit registers as
described in Table 1.
Intel processors that are based on the Westmere, Sandy According to the vendor documentation [3] and consis-
Bridge, Ivy Bridge, or Haswell microarchitecture implement tent with PCI register readings on our test system2 only two
at least four core C-states, including CC0, CC1, CC3, and
CC6 [1317]. Some models implement CC7 as well.
2The PCI register readings are:
When a processor core enters the CC1 state, the cache
D18F4 0x118: 0x0107000Bh - settings for C1 (15:0) and C2 (31:16)
entries and the architectural state are preserved. The proces- D18F4 0x11C: 0x00000000h - C3 (15:0) not configured
sor core also continues to process snoop requests in order to D18F4 0x128: 0x00005500h - C1 cache flush timer = 28h (11:5)
maintain cache coherence. If all cores in a processor are in the D18F3 0x0DC: 0x05475632h - C2 cache flush timer = 28h (25:19).
123
222 R. Schne et al.
Table 1 AMD family 15h model 00h-0Fh C-state control register [3] 4 Methodology
Bit Setting Options
To determine the wake-up times for processors, we patched
0 Direct probe Frequency used to handle probes the Linux kernel (3.13) such that it provides additional sysfs
0b: use frequency defined by P-state entries for each cpu in /sys/devices/system/cpu/cpu*/cpuidle
1b: adhere to clock divisor (see 7:5) /measure These entries are used to trigger the wake-up of the
1 Cache flush 0b: disabled, 1b: enabled corresponding core, store the results of a single measurement,
3:2 Flush timer Select timer register and constitute the interface between user- and kernel-space.
01b: D18F3 0x0DC, 10b: D18F4 0x128 The patch also alters the Linux kernel in the following
4 Reserved n/a way: Whenever a CPU issues the function cpuidle_idle_call
7:5 Clock divisor 000b: disabled, 001b: 2, 010b: 4, and the measurement flag is set, this CPU updates a per-CPU
011b: 8, 100b: 16, 101b: 128, data structure. This data structure consists of the following
110b: 512, 111b: turn off clock attributes:
8 Power gating 0b: disabled, 1b: enabled
15:9 Reserved n/a measure flag to enable measurement
init_tmstmp timestamp when the measurement was ini-
tialized
C-states are configured by default. We will refer to them as
c_state C-state left by the wake-up call
CC1 and CC6.
result time between initiation and completion of the
In CC1 (settings: 000Bh) the compute unit does not apply a
wake-up call
frequency divisor, i.e. the frequency is defined by the P-state
prior to entering the C-state. Consequently, the frequency
does not need to be ramped up to answer incoming probe In the following we will refer to the processor core that ini-
requests. After approximately 407 s, the L1 and L2 caches tiates the wake-up sequence as caller and the processor core
are flushed. Furthermore, the cache flush success monitor that is woken up as callee. When starting a measurement, the
can be used to record the rate of cache flush timer expira- caller:
tions relative to C-state exits. If the success rate is higher
than the configured threshold (D18F4 0x28[20:18]), caches 1. resets the callees result to zero
are flushed before the timer expires. However, the feature is 2. sets the callees measurement flag
disabled (D18F4 0x128[20:18] = 000b) by default. Power 3. takes the before timestamp using the TSC
gating is not used in CC1. The CC6 state (settings: 0107h) 4. calls the function wake_up_nohz_cpu
uses the same frequency settings. However, it also applies
power gating. Therefore, the operating frequency is only rel- When the callee wakes up, it checks its result and measure-
evant until caches are flushed and the compute unit is powered ment data. If they are set to 0, resp. to 1 the callee:
down completely. CC6 uses a separate timer register for the
cache flush, but the specified timeout equals the CC1 set- 1. stores the difference between the current TSC value and
ting. The cache flush success monitor is not supported by the the before timestamp in the result field
selected timer register. On our system, the processor states 2. stores the information about the C-state it just left it in
of all cores that go to CC6 are stored in the DRAM of one the c_state field.
NUMA node, which influences the wake-up latency.
AMD family 15h processors model 10h-1Fh introduce We use a python script to take measurements for every
package level power savings, namely: C-state and frequency combination available on the system.
The python script utilizes powertop to activate all power sav-
DRAM self-refresh ing options to avoid unwanted wake up calls e.g., from not
northbridge clock and power gating suspended USB devices. It uses the additional sysfs entries to
package power off (PC6) initiate the measurements and collect the results. The C-state
of idle cores is controlled by disabling all higher C-states via
Flags for the new features are added to the C-state control sysfs entries.3 Furthermore, the time between the measure-
registers, thus can be configured independently for every C- ments is chosen high enough to reach the highest available
state. DRAM self-refresh and northbridge clock gating can C-state.
be used together with cores in CC1. Power gating the north-
bridge or powering down the whole package (PC6) requires 3/sys/devices/system/cpu/cpu*/cpuidle/state*/
all cores to be in CC6. disable.
123
(a) local measurement (b) Remote Idle
(c) Remote Active (d) Legend
Fig. 1 Measurements
The script collects the wake-up data from a Local mea- Table 2 Hardware configuration
surement. In this case the caller and the callee belong to Vendor Intel AMD
the same package. On multi socket systems the script also
collects wake up data from the other packages. In this case Processor Xeon X5670 Xeon E5-2670 Opteron 6274
the callee is located in a different package than the caller. Codename Westmere-EP Sandy Bridge-EP Bulldozer
The package C-state is not controllable by software thus we Cores 2x 6 2x 8 4x 16
make two distinct measurements on other packages. They Base clock 2.933 GHz 2.6 GHz 2.2 GHz
are called Remote Idle and Remote Active. In these measure- Max. turbo 3.333 GHz 3.3 GHz 3.1 GHz
ments the caller and callee are the same. The only difference clock
is that on Remote Active we do busy waiting on an additional Uncore/NB 2.666 GHz (Core clock) 2.0 GHz
clock
core on the package of the callee. Thus, the package is unable
C-States C1, C1E, C3, C6 C1, C1E, C3, C6, C7 CC1, CC6
to enter a package C-state in that case. All different measure-
ment possibilities for a two socket system are depicted in Package PC1E, PC3, PC6 n/a
C-States
Fig. 1. The AMD test system implements two dies per proces-
sor. As the two dies within one processor package share some
resources, they might also influence each others idle behav-
ior. We therefore measure the remote wake-up latencies for a Based on that we calculated the hypothetical idle power for all
core on the second die in the package (Near Remote) as well cores in a certain core C-state without using the correspond-
as a core in another package (Far Remote). ing package C-states. The voltage and frequency reduction
We measure each combination of C-state and P-state at in C1E saves 20 W per socket compared to all cores in CC1
least 400 times and filter results where the operating system without C1E. The package C-states PC3 and PC6 decreased
requested a C-state that differs from the highest possible one. the power consumption by 18, resp. 14 W per socket in addi-
This occured for less then one percent of the samples. tion to the CC3 and CC6 savings.
Idle power consumption is dominated by non-processor
components (e.g., network interfaces, accelerator cards, or 5 Results
hard drives). Thus, the optimal C-state is highly system spe-
cific. If these components have a high power consumption, a In this section we present the wake-up latencies for various C-
faster wake-up time can be more beneficial than a small sav- states on a selection of multi-socket servers. The test systems
ing in power consumption. Vice versa, if they use only a small are detailed in Table 2. To create an equal testbed we used
amount of energy, a higher C-state can be more favorable. Ubuntu 13.10 with our patched Linux kernel on all systems.
We measured the power consumption of our Sandy Bridge We took care that Auto-Demotion and Auto-Undemotion had
EP test system with an ZES ZIMMER LMG 450 watt meter been disabled on Intel processors before measuring the wake-
that we attached to the power supply. We measured the power up times by disabling the respective bits in the MSRs.
savings relative to the C0 state where cores run a while(1) In Fig. 2 we show that the transition latency for CC1 states
loop. For P-state P0, the power intake of each core is reduced highly depends on the processor frequency that is applied. In
by 3 W in CC1, 4 W in CC3, and 5 W in CC6 as well as CC7. our measurement setup, all processor cores share the same
123
224 R. Schne et al.
(a) Westmere EP Local (b) Sandy Bridge EP Local (c) Sandy Bridge EP Remote Active
(d) Bulldozer Local (e) Bulldozer Near Remote Active (f) Bulldozer Far Remote Active
Fig. 2 CC1 (halt) state for different processors
frequency. A lower frequency correlates with a higher tran- in another package if the package is not in the package C-
sition latency on all test systems. The wake-up time is influ- state. Still, the PC3 state (Remote Idle) increases the wake-up
enced by an increased message latency between the cores as time by about 15,000 cycles on the Sandy Bridge-EP plat-
well as the slower execution of the instructions that are exe- form. The CC6 transition latency of our Intel test systems are
cuted during the wake-up procedure. Surprisingly, the older depicted in Fig. 5. Compared to the CC3 state, CC6 intro-
Westmere-EP system shows a noticeable advantage over the duces an additional transition latency. This latency is influ-
Sandy Bridge-EP system. The CC1 wake-up time is much enced by the processor frequency, if the processor doesnt go
higher on the AMD processors than on both Intel systems. to PC6. However, the improvements of Sandy Bridge with
On all test systems, the latency increases when the callee is respect to the transition latency that we have seen for CC3
in a remote processor. are also present for CC6. We do not depict the CC7 state as
As we described in Sect. 3, the Intel C1E state behaves like it necessarily causes the same latency as the CC6.
the CC1 state, with one exception. When all cores go to CC1, The CC6 transition latencies of our AMD test system are
the voltage and frequency is reduced. The effect on the wake- depicted in Fig. 6. The frequency has a strong influence. Sur-
up times is depicted in Fig. 3. The Remote Idle measurements prisingly, the lowest frequency has the shortest latency for
can use the C1E state while the Remote Active measurements transitions to the CC0 state if a core in the same package
avoid C1E by the additional active core in the callees pack- is woken up. This is caused by additional frequency tran-
age. As expected, the wake-up times of both test systems sitions that are required at higher frequencies as the CC6
at their respective minimal frequency are not influenced by state reduces the frequency to the one defined by the high-
C1E since the frequency is not reduced further. However, if est P-state.4,5 The wake-up latency for Near Remote cases
the system is configured for a higher frequency, the wake- is faster than the Local one. We attribute that to the distance
up from C1E takes noticeably longer than a wake-up from between the callee and the package that holds the C-state
CC1 because of the required voltage and frequency ramp. information in its DRAM. In our case, the local processor is
On the Westmere-EP system that overhead is so high that the two HyperTransport hops away from this package, the near-
wake-up latency is higher than at minimum frequency.
The Intel CC3 state wake-up times depicted in Fig. 4 are 4 D18F3 xA8[31:29] PopDownPstate = D18F3 xDC[10:8] HwP-
more or less independent of the processor frequency. How- stateMaxVal.
ever, it has been reduced significantly from Westmere-EP to 5 Each compute unit has its own frequency domain while all CUs share
Sandy Bridge-EP. The latency is almost identical for cores one voltage domain.
123
(a) Westmere EP Remote Active (b) Westmere EP Remote Idle (c) Sandy Bridge EP Remote Ac- (d) Sandy Bridge EP Remote Idle
tive
Fig. 3 C1E state on Westmere EP and Sandy Bridge-EP processor
(a) Westmere EP Local (b) Sandy Bridge EP Local (c) Sandy Bridge EP Remote Ac- (d) Sandy Bridge EP Remote Idle
tive (Package C3)
Fig. 4 C3 states for Westmere-EP and Sandy Bridge-EP processors
(a) Westmere EP Local (b) Sandy Bridge EP Local (c) Sandy Bridge EP Remote Ac- (d) Sandy Bridge EP Remote Idle
tive (Package C6)
Fig. 5 C6 states for Westmere-EP and Sandy Bridge-EP processors
(a) Local (b) Near Remote Idle (c) Far Remote Active (d) Far Remote Idle
Fig. 6 C6 states for Bulldozer processor
remote processor is only one hop away. As we have shown is already at the required level for the requested frequency.
in [21] this distance has a high influence on the latency and Thus, the frequency transition is initiated immediately and
bandwidth between different processors. Thus, it is faster for contributes to the measured latency.6 The situation is dif-
the near-remote processor to re-establish its processor state. ferent for completely idle processors. In that case all cores
The behavior for cores in other packages is identical if
there is one active core in the callees package. In that 6 The processing of instructions is stopped during the frequency tran-
case the core voltagethat is shared by all compute units sition as there is no stable clock signal.
123
226 R. Schne et al.
entered the CC6 state. Consequently, the voltage is reduced to overheadespecially for the deep sleep statesis not negligi-
the required voltage for the lowest frequency. Therefore, the ble. The results and methodologies that we have described in
frequency transition has to be delayed until the voltage has this paper can be used to model the behavior of power saving
been ramped up to the required level. However, the process- techniques like DCT and idle-waiting, but also to improve
ing of instructions continues during the voltage ramp using the energy efficiency of partially idling computing systems.
the lower frequency. Therefore, the callee completes its oper- Furthermore, the exact latencies could also be used to enable
ations before the target frequency is restored. Anyway, the more power saving mechanisms in real-time systems without
frequency change is only delayed. The associated overhead the risk of missing deadlines.
is just not included in the measurement. In addition to the latency for leaving an idle state, the
According to the vendor documentation, the CC6 latency latency to enter an idle state would be of interest for power
should be inversely proportional to the P-state frequency [3]. optimizations. This latency could be concluded from mea-
Obviously, this not the case. However, the transition latency suring the TSC before going to idle, the TSC when leaving
decreases significantly from 1.6 to 2.2 GHz. We attribute this an idle state and C-state residency counters.
to the instructions executed by the caller and callee during
the measurement as well as the corresponding operations Acknowledgments This work has been funded by the Bundesminis-
terium fr Bildung und Forschung via the research projects CoolSilicon
performed by the kernel. The difference between 2.0 and (BMBF 16N10186) and Score-E (BMBF 01IH13001).
2.2 GHz is most significant. Since this gap also exists in the
Remote Idle casewhere the callees operations are always
performed at 1.4 GHzwe attribute it to the callers side of the
measurement. Most likely it is caused by the execution of the
function wake_up_nohz_cpu since this is the only complex References
operation performed by the caller.
1. Advanced configuration and power interface (acpi) specification,
In our setup, the Intel Sandy Bridge-EP system reports
revision 5.0 (2011). http://www.acpi.info/. Accessed 1 Apr 2014
overestimated transition latencies for CC3 and CC6, which is 2. Cherin T, David R, Lana B, Alison Y (ed) (2013) Realtime tuning
compliant to the ACPI standard. Still, a more accurate infor- guide-advanced tuning procedures for the realtime component of
mation would be beneficial for the operating system to select Red Hat Enterprise MRG, 4 edn. Red Hat
3. Advanced Micro Devices (2012) BIOS and Kernel devel-
an appropriate C-state, thus saving energy. On the AMD sys-
opers guide (BKDG) for AMD Family 15h Models 00h
tem, the CC1 state is reported o have a 0 s transition latency, 0Fh Processors. http://support.amd.com/us/Processor_TechDocs/
which is incorrect according to the ACPI standard. However 42301_15h_Mod_00h-0Fh_BKDG.pdf. Rev 3.12, Oct 11, 2012
according to the processor manual [3] the internal C-states do 4. Barreda M, Cataln S, Dolz MF, Fabregat G, Mayo R, Quintana-
Ort ES (2013) Automatic detection of power bottlenecks in parallel
not directly translate to ACPI C-states. The 100 s reported
scientific applications. Comput Sci Res Dev: 19. doi:10.1007/
for the CC6 state are a reasonable estimate for the 7897 s s00450-013-0242-8
we observed. 5. Butts J, Sohi G (2000) A static power model for architects. In:
Microarchitecture, 2000. MICRO-33. Proceedings. 33rd Annual
IEEE/ACM International Symposium on, pp 191201. doi:10.
1109/MICRO.2000.898070
6 Conclusion and further work 6. Choi K, Lee W, Soma R, Pedram M (2004) Dynamic voltage
and frequency scaling under a precise energy model consider-
In this paper, we have described a way to determine more ing variable and fixed components of the system power dissipa-
tion. In: Computer Aided Design, 2004. ICCAD-2004. IEEE/ACM
accurate wake-up times for core and package C-states. International Conference on, pp 2934. doi:10.1109/ICCAD.2004.
We additionally showed that the wake-up times depend on 1382538
requested frequency, depth of C-state, depth of package C- 7. Curtis-Maury M, Dzierwa J, Antonopoulos CD, Nikolopoulos
state and latency between the processor core that issues the DS (2006) Online power-performance adaptation of multithreaded
programs using hardware event-based prediction. In: Egan GK,
wake-up and the processor core that is woken up. We also Muraoka Y (eds) ICS. ACM, New York, pp 157166
compared the measured results with those that are exposed 8. Curtis-Maury M, Singh K, McKee SA, Blagojevic F, Nikolopoulos
to the operating system. The reported values are not neces- DS, De Supinski BR, Schulz M (2007) Identifying energy-efficient
sarily accurate. Linux reports the ACPI transition latencies concurrency levels using machine learning. In: Cluster computing,
2007 IEEE International Conference on. IEEE, pp 488495
via a read-only file. As these latencies can be misleading or 9. Ge R, Feng X, chun Feng W, Cameron K (2007) CPU miser: a
even completely wrong, we propose that the corresponding performance-directed, run-time system for power-aware clusters.
sysfs entries should be writable. Such an interface could be In: Parallel Processing, 2007. ICPP 2007. International Conference
used to make the correct numbers available to the operating on, pp 1818. doi:10.1109/ICPP.2007.29
10. Hsu Ch, Feng Wc (2005) A power-aware run-time system for high-
system. performance computing. In: Proceedings of the 2005 ACM/IEEE
The power savings of C-states are substantial, thus they Conference on Supercomputing, SC 05. IEEE Computer Society,
should be used as often as possible. However, the associated Washington, DC, p 1. doi:10.1109/SC.2005.3
123
11. Intel (2014) Intel 64 and IA-32 Architectures Software Developers Robert Schne is a research sci-
Manual vol 3A, 3B, and 3C: System Programming Guide entist at the Center for Informa-
12. Intel Corporation (2011) 2nd Generation Intel Core Processor Fam- tion Services and High Perfor-
ily Desktop, Datasheet, vol 1 mance Computing at Technische
13. Intel Corporation (2011) Intel Core i5600, i3500 Desktop Universitt Dresden. He grad-
Processor Series, Intel Pentium Desktop Processor 6000 Series uated in Computer Science at
14. Intel Corporation (2011) Intel Xeon 5600 Series, Datasheet, vol 1 Technische Universitt Dresden,
15. Intel Corporation (2014) Desktop 4th Generation Intel Core Proces- obtaining his diploma in 2006.
sor Family, Desktop Intel Pentium Processor Family, and Desktop Robert is now involved in the
Intel Celeron Processor Family, Datasheet vol 1 of 2 project Score-E which focuses
16. Intel Corporation (2014) Intel Xeon Processor E51600/E5- on energy efficiency analysis of
2600/E5-4600 v2 Product Families, Datasheet, vol 1 of 2 highly parallel applications. He
17. Intel Corporation (2012) Intel Xeon Processor E51600/E5- is interested in processor micro
2600/E5-4600 Product Families, Datasheet, vol 1 architecture, performance analy-
18. Intel Corporation (2013) Desktop 3rd Generation Intel Core sis and energy efficiency.
Processor Family, Desktop Intel Pentium Processor Family, and
Desktop Intel Celeron Processor Family, Datasheet, vol 1 of 2
Daniel Molka studied Computer
19. Knobloch M, Mohr B, Minartz T (2012) Determine energy-saving
Science at Technische Univer-
potential in wait-states of large-scale parallel programs. Comput
sitt Dresden and received his
Sci Res Dev 27:255263. doi:10.1007/s00450-011-0196-7
diploma in 2008. Since then he
20. Mazouz A, Laurent A, Pradelle B, Jalby W (2013) Evaluation
is working as a research scien-
of CPU frequency transition latency. Comput Sci Res Dev: 19.
tist at the Center for Information
doi:10.1007/s00450-013-0240-x
Services and High Performance
21. Molka D, Hackenberg D, Schne R (2014) Main memory and cache
Computing. His current research
performance of intel sandy bridge and amd bulldozer. In: ACM
project HDEEM focuses on
SIGPLAN Workshop on Memory Systems Performance and Cor-
improving the energy efficiency
rectness (MSPC). doi:10.1145/2618128.2618129
of HPC systems. Daniel is
22. Rountree B, Lownenthal DK, de Supinski BR, Schulz M, Freeh
mainly interested in processor
VW, Bletsch T (2009) Adagio: making dvs practical for complex
micro architecture, cache coher-
hpc applications. In: Proceedings of the 23rd international confer-
ence mechanisms, and the energy
ence on Supercomputing, ICS 09. ACM, New York, pp 460469.
efficiency thereof.
doi:10.1145/1542275.1542340
23. Schne R, Hackenberg D (2011) On-line analysis of hardware
performance events for workload characterization and processor Michael Werner studies Com-
frequency scaling decisions. In: Proceedings of the second joint puter Science at Technische Uni-
WOSP/SIPEW international conference on Performance engi- versitt Dresden. Since 2010 he
neering, ICPE 11. ACM, New York, pp 481486. doi:10.1145/ is a student assistant at the
1958746.1958819 Center for Information Services
24. Schne R, Hackenberg D, Molka D (2012) Memory performance and High Performance Comput-
at reduced cpu clock speeds: an analysis of current x86_64 proces- ing. He is interested in energy
sors. In: Proceedings of the 2012 USENIX conference on Power- efficiency optimizations on the
Aware Computing and Systems, HotPower12. USENIX Associ- hardware, operating system, and
ation, Berkeley, p 9. http://dl.acm.org/citation.cfm?id=2387869. application level. Michaels cur-
2387878. Accessed 1 Apr 2014 rent research is focused on con-
25. Schne R, Tschter R, Ilsche T, Hackenberg D (2011) The vam- trolling and monitoring power-
pirtrace plugin counter interface: introduction and examples. In: saving mechanisms via low-level
Proceedings of the 2010 conference on Parallel processing., Euro- interfaces.
Par 2010. Springer-Verlag, Berlin, Heidelberg, pp 501511
26. Suji C, Maragatharaj S, Hemima R (2011) Performance analysis of
power gating designs in low power vlsi circuits. In: Signal Process-
ing, Communication, Computing and Networking Technologies
(ICSCCN), 2011 International Conference on, pp 689694. doi:10.
1109/ICSCCN.2011.6024639
27. Tiwari A, Laurenzano M, Peraza J, Carrington L, Snavely A (2012)
Green queue: Customized large-scale clock frequency scaling. In:
Cloud and Green Computing (CGC), 2012 Second International
Conference on, pp 260267. doi:10.1109/CGC.2012.62
28. Wu Q, Pedram M, Wu X (2000) Clock-gating and its application to
low power design of sequential circuits. Circ Syst Fundam Theory
Appl IEEE Trans 47(3):415420. doi:10.1109/81.841927
123

Wake-Up Latencies For Processor Idle States On Current x86 Processors

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Wake-Up Latencies For Processor Idle States On Current x86 Processors

Enviado por

Direitos autorais:

Formatos disponíveis

Comput Sci Res Dev (2015) 30:219227

SPECIAL ISSUE PAPER

Wake-up latencies for processor idle states on current x86

Published online: 5 July 2014

(a) local measurement (b) Remote Idle

(c) Remote Active (d) Legend

Fig. 2 CC1 (halt) state for different processors

Fig. 3 C1E state on Westmere EP and Sandy Bridge-EP processor

Fig. 4 C3 states for Westmere-EP and Sandy Bridge-EP processors

Fig. 5 C6 states for Westmere-EP and Sandy Bridge-EP processors

Fig. 6 C6 states for Bulldozer processor

Você também pode gostar