Hardware and software co-design for robust and resilient execution

How do we design error-tolerant processors (and associated systems) that meet historically established high reliability standards, without exceeding fixed power budgets and cost constraints? This is the fundamental technological research challenge that present-day and future systems architects face. In the late CMOS era, device-scaling trends have resulted in an increased awareness of the various sources of unreliability at the component level. Designing and building robust processors is becoming increasingly challenging in the face of growing device susceptibility to transient and hard errors. Some solutions, such as those that circumvent the power problem today, have in fact been shown to worsen conditions for the emerging new device “reliability wall.” Future systems will require designers across all layers of the system stack to integrate adaptive design techniques, at both the hardware and software layers, to ensure robust and resilient execution.