IBM Research | Thermal-aware scheduling at the system level
Overheated chips can have adverse effects on packaging costs, processor power and chip reliability.
Elevated on-chip hot spots are important considerations in current- and future-generation microprocessor design because they can have adverse effects on packaging costs, processor power, cooling power and even chip reliability. The emerging trend in multi-core, multi-threaded microprocessor architecture, however, offers an opportunity to mitigate hot-spot problems through thermal-aware scheduling of workloads.
In our research, we investigated the trade-offs between temporal and spatial hot-spot mitigation schemes and thermal time constants, workload variations and microprocessor power distributions. Two observations motivated our work:
1. In view of advanced, fine-grain clock-gating technology prevalent in modern processors, on-chip temperature hot spots and power profiles are closely linked to unit-level utilization.
2. The rise and fall times of on-chip temperatures are typically in the hundreds of milliseconds range. This is at least an order of magnitude larger than Operating System (OS) scheduler ticks, which are in the range of milliseconds.
By leveraging spatial and temporal heat slacks, our schemes help lower on-chip unit temperatures by changing the workload in a timely manner with OS and existing on-chip digital thermal sensors.
We performed our experiments on a live POWER system running Bare Metal Linux (BML). To measure the changes in on-chip temperatures, we changed BML to sample the on-chip thermal sensors at every scheduler tick with the granularity of four milliseconds (ms) per sample. We calibrated the thermal sensors using the Spatially-resolved Imaging of Microprocessor Power (SIMP) methodology developed at IBM Research by Hamann et al. To make the infra-red imaging possible, we replaced the metal heat-sink of the POWER system with a transparent liquid heat sink, as illustrated in Figure 1.
Using the above-mentioned experimental setup, we achieved potential temperature savings by using a static workload scheduler.
Using hot spot mitigation techniques without creating more complexity
To show that a modern OS such as Linux can implement hot spot mitigation techniques without additional complexity, we developed a prototype thermal-aware scheduler extension to the Linux kernel (linux-2.6.17). Figure 2 shows three schemes we prototyped in Linux: Heat-balancing, deferred execution, reducing threading with cool-loop augmentation. These three techniques have different triggering conditions and response times. The heat-balancing technique, for example, has a slower response time, but it has less overhead. Therefore, we configured the system to trigger heat-balancing at lower temperatures. If heat-balancing was not able to stop the elevation of on-chip temperatures, then the next technique -- deferred execution -- would be triggered.
To implement heat-balancing, we augmented the load-balancing code in the Linux scheduler so that it takes the thermal characteristics of each task and core into account when making task-migration decisions. When the system is overloaded, the heat-balancing extension attempts to assign hot and cold tasks to each core to create opportunities for leveraging temporal heat slack. When the system has fewer tasks than the number of cores, the heat-balancing routine moves a hot task to a colder, idle core, thereby creating opportunities to leverage spatial heat slacks.
When the heat-balancing is not triggered in time to prevent a rise in temperatures, the deferred execution scheme temporarily suspends the hot task to allow other colder tasks to run.
To cover cases where either heat-balancing or deferred execution could not reduce on-chip hot spots, we implemented a new kernel task called Cool Loop to encourage temporal heat slack. The scheduler can decrease temperatures either by reducing threading -- by running cool loops on both hardware threads in a time-interleaving manner so that only one hardware-thread is active during cooling period -- or by suspending all threading -- by running cool loops on both hardware-threads at the same time during the cooling period.
System Software Solutions
When the cool loop is running, it does not perform useful computation. It has an OS priority, though, that is higher than user tasks but lower than interrupt service routines and scheduler ticks. Allowing interrupt servicing routines does not impose additional risks of further heating because interrupt routines, including scheduler ticks, are typically short. After triggering cool loops, the scheduler checks the core temperature at every tick and resumes user tasks when the core temperature drops to a preset colder temperature.
Current power-aware microarchitecture R&D focuses mainly on power reduction (at minimal performance cost). In terms of reducing cooling cost and system reliability, however, hot-spot mitigation is more important. Moreover, the need increasingly is shifting toward finding thermal mitigation solutions that involve system software. Hardware-only solutions imply additional verification cost and power/performance overhead. These are becoming hard to justify.
Our current work, therefore, is a step in the right direction for modern temperature-aware system design. Lowering the average temperature by three-to-five degrees can reduce leakage power by about five percent in current technologies. The exact reduction amount depends on the base thermal operating point and the particular CMOS technology node. Such thermal reduction also results in increase of mean-time-to-failure. Quantification of the reliability benefit depends on the exact chip design details, but a three-to-five degree reduction in temperature is significant. In view of more aggressive on-chip power management features, the temperature reduction in future systems will be even greater.
Although we performed our experiments at the OS level to try to quantify real savings from temperature-aware scheduling, we believe that the system hypervisor and workload manager also can implement these schemes. We have started discussions with OS and hypervisor groups within IBM Research and Systems & Technology Group (STG) in conjunction with the system-level (hardware) power management group at the IBM Research Lab in Austin, where we will investigate the technical feasibility of applying this technique to future IBM server systems.