Project Name
Tab navigation
Overheated chips can have adverse effects on packaging costs, processor power and chip reliability.
Elevated on-chip hot spots are important considerations in current- and future-generation microprocessor design because they can have adverse effects on packaging costs, processor power, cooling power and even chip reliability. The emerging trend in multi-core, multi-threaded microprocessor architecture, however, offers an opportunity to mitigate hot-spot problems through thermal-aware scheduling of workloads.
In our research, we investigated the
trade-offs between temporal and spatial hot-spot mitigation schemes and thermal
time constants, workload variations and microprocessor power distributions. Two
observations motivated our work:
1. In view of advanced, fine-grain clock-gating technology prevalent in modern processors, on-chip temperature hot spots and power profiles are closely linked to unit-level utilization.
2. The rise and fall times of on-chip temperatures are typically in the hundreds of milliseconds range. This is at least an order of magnitude larger than Operating System (OS) scheduler ticks, which are in the range of milliseconds.
By leveraging spatial and temporal heat
slacks, our schemes help lower on-chip unit temperatures by changing the
workload in a timely manner with OS and existing on-chip digital thermal
sensors.
Fig. 1


SIMP experimental
setup: Real-time thermal mapping of the processor. (Left): Infra-red imaging.
(Right): Experimental setup with IR camera.
We performed our experiments on a live
POWER system running Bare Metal Linux (BML). To measure the changes in on-chip
temperatures, we changed BML to sample the on-chip thermal sensors at every
scheduler tick with the granularity of four milliseconds (ms) per sample. We
calibrated the thermal sensors using the Spatially-resolved Imaging of
Microprocessor Power (SIMP) methodology developed at IBM Research by Hamann et
al. To make the infra-red imaging possible, we replaced the metal heat-sink of
the POWER system with a transparent liquid heat sink, as illustrated in Figure
1.
Using the above-mentioned experimental
setup, we achieved potential temperature savings by using a static workload
scheduler.
Using hot spot mitigation techniques
without creating more complexity
To show that a modern OS such as Linux
can implement hot spot mitigation techniques without additional complexity, we
developed a prototype thermal-aware scheduler extension to the Linux kernel
(linux-2.6.17). Figure 2 shows three schemes we prototyped in Linux:
Heat-balancing, deferred execution, reducing threading with cool-loop
augmentation. These three techniques have different triggering conditions and
response times. The heat-balancing technique, for example, has a slower
response time, but it has less overhead. Therefore, we configured the system to
trigger heat-balancing at lower temperatures. If heat-balancing was not able to
stop the elevation of on-chip temperatures, then the next technique -- deferred
execution -- would be triggered.
Fig. 2

Thermal-aware
scheduler prototyped in Bare Metal Linux (kernel 2.6.17).
To implement heat-balancing, we
augmented the load-balancing code in the Linux scheduler so that it takes the
thermal characteristics of each task and core into account when making
task-migration decisions. When the system is overloaded, the heat-balancing
extension attempts to assign hot and cold tasks to each core to create
opportunities for leveraging temporal heat slack. When the system has fewer
tasks than the number of cores, the heat-balancing routine moves a hot task to
a colder, idle core, thereby creating opportunities to leverage spatial heat
slacks.
When the heat-balancing is not triggered
in time to prevent a rise in temperatures, the deferred execution scheme
temporarily suspends the hot task to allow other colder tasks to run.
To cover cases where either
heat-balancing or deferred execution could not reduce on-chip hot spots, we
implemented a new kernel task called Cool Loop to encourage temporal heat
slack. The scheduler can decrease temperatures either by reducing threading --
by running cool loops on both hardware threads in a time-interleaving manner so
that only one hardware-thread is active during cooling period -- or by
suspending all threading -- by running cool loops on both hardware-threads at
the same time during the cooling period.
System Software Solutions
When the cool loop is running, it does
not perform useful computation. It has an OS priority, though, that is higher
than user tasks but lower than interrupt service routines and scheduler ticks.
Allowing interrupt servicing routines does not impose additional risks of
further heating because interrupt routines, including scheduler ticks, are
typically short. After triggering cool loops, the scheduler checks the core
temperature at every tick and resumes user tasks when the core temperature
drops to a preset colder temperature.
Current power-aware microarchitecture
R&D focuses mainly on power reduction (at minimal performance cost). In
terms of reducing cooling cost and system reliability, however, hot-spot
mitigation is more important. Moreover, the need increasingly is shifting
toward finding thermal mitigation solutions that involve system software.
Hardware-only solutions imply additional verification cost and
power/performance overhead. These are becoming hard to justify.
Our current work, therefore, is a step
in the right direction for modern temperature-aware system design. Lowering the
average temperature by three-to-five degrees can reduce leakage power by about
five percent in current technologies. The exact reduction amount depends on the
base thermal operating point and the particular CMOS technology node. Such
thermal reduction also results in increase of mean-time-to-failure.
Quantification of the reliability benefit depends on the exact chip design
details, but a three-to-five degree reduction in temperature is significant. In
view of more aggressive on-chip power management features, the temperature
reduction in future systems will be even greater.
Although we performed our experiments at
the OS level to try to quantify real savings from temperature-aware scheduling,
we believe that the system hypervisor and workload manager also can implement
these schemes. We have started discussions with OS and hypervisor groups within
IBM Research and Systems & Technology Group (STG) in conjunction with the
system-level (hardware) power management group at the IBM Research Lab in Austin, where we will investigate the technical feasibility of applying this technique to
future IBM server systems.

