Reliability and Power-Aware Microarchitectures - overview
Power Efficient Architectures:
We explore research problems related to power and complexity-aware microprocessor design, analysis, and validation. The main objectives are (1) to innovate and evaluate novel microarchitectures and (2) to develop a robust modeling and validation infrastructure to support this research. In order to achieve the first goal, designs must be complexity-aware. Two major modern constraints are power consumption and verification cost, which are two indices of complexity. In addition, verification cost determines time-to-market. As part of the first goal, we are learning how to build tomorrow's processors which will allow scalable system performance with manageable growth in power and verification complexity. In order to achieve the second goal, we integrate energy models into traditional cycle-accurate performance simulators. Circuit-level tools as well as analytical energy models are used to to build the energy models.
Reliability-aware architectures:
This project is focused on microarchitectural support to ensure reliable operation, in the face of hard and soft hardware failures. The technological trends that are causing high on-chip temperatures and increased soft-error rates have motivated architects to pursue innovative new approaches to maintain target reliability figures, without exceeding power budgets or incurring significant performance degradation. In this project, we focus on the following problems:
- How to model the effects of hard and soft failures (that first manifest at the device or interconnect level) up at the architecture or system level? Can we project the chip-level failure rate (in FITs) or the mean time to failure (MTTF) for various input workloads of interest?
- How to validate and calibrate pre-silicon predictive models that project reliability metrics of a target chip or system? What are the limits of applicability of our assumed modeling axioms?
- What indeed are the right metrics to use?
- How best to apply the principles of spatial and temporal redundancy in architecting solutions that provide error tolerance, while maintaining performance targets and power budgets.
This research project is built upon strong collaborative ties with external university groups as well as internal development group partners within the IBM Systems and Technology Group.