This project is focused on microarchitectural support to ensure reliable operation, in the face of hard and soft hardware failures. The technological trends that are causing high on-chip temperatures and increased soft-error rates have motivated architects to pursue innovative new approaches to maintain target reliability figures, without exceeding power budgets or incurring significant performance degradation. In this project, we focus on the following problems:
- How to model the effects of hard and soft failures (that first manifest at the device or interconnect level) up at the architecture or system level? Can we project the chip-level failure rate (in FITs) or the mean time to failure (MTTF) for various input workloads of interest?
- How to validate and calibrate pre-silicon predictive models that project reliability metrics of a target chip or system? What are the limits of applicability of our assumed modeling axioms?
- What indeed are the right metriics to use?
- How best to apply the principles of spatial and temporal redundancy in architecting solutions that provide error tolerance, while maintaining performance targets and power budgets.