Process variation and temperature-aware reliability management

In aggressively scaled technologies, reliability concerns such as oxide breakdown have become a key issue. Dynamic reliability management (DRM) has been proposed as a mechanism to dynamically explore the trade-off between system performance and reliability margin. However, existing DRM methods are hampered by the fact that they do not accurately model spatial and temporal variations in process and temperature parameters which have a strong impact on chip reliability. In addition, they make the simplifying assumption that the future workloads are identical to the currently observed one. This makes them sensitive to sudden workload variations and outliers. In this paper, we present a novel workload-aware dynamic reliability management framework that accounts for local variations in both the process and temperature. The reliability estimation, along with the predicted remaining workload is fed to a dynamic voltage/frequency scaling module to manage the system reliability and optimize processor performance. Using a fast on-line analytical/table-look-up method we demonstrate an average error of 1% with up to 5 orders of magnitude speedup compared to Monte Carlo simulation. Experiments on an Alpha-like processor show our DRM framework fully utilizes the available margin and achieves 28.7% performance improvement on average.

[1]  J. Stathis Physical and predictive models of ultrathin oxide reliability in CMOS devices and circuits , 2001 .

[2]  J. Sune,et al.  Statistics of successive breakdown events in gate oxides , 2003, IEEE Electron Device Letters.

[3]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[4]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[5]  J. Stathis Physical and predictive models of ultra thin oxide reliability in CMOS devices and circuits , 2001, 2001 IEEE International Reliability Physics Symposium Proceedings. 39th Annual (Cat. No.00CH37167).

[6]  Kevin Skadron,et al.  Interconnect lifetime prediction under dynamic stress for reliability-aware design , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..

[7]  Shuguang Feng,et al.  Self-calibrating Online Wearout Detection , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[8]  Yung-Huei Lee,et al.  Prediction of Logic Product Failure Due To Thin-Gate Oxide Breakdown , 2006, 2006 IEEE International Reliability Physics Symposium Proceedings.

[9]  David Blaauw,et al.  A statistical approach for full-chip gate-oxide reliability analysis , 2008, 2008 IEEE/ACM International Conference on Computer-Aided Design.

[10]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[11]  Jordi Suñé,et al.  Interplay of voltage and temperature acceleration of oxide breakdown for ultra-thin gate oxides , 2002 .

[12]  Sachin S. Sapatnekar,et al.  Statistical Timing Analysis Considering Spatial Correlations using a Single Pert-Like Traversal , 2003, ICCAD 2003.

[13]  Pradip Bose,et al.  The case for lifetime reliability-aware microprocessors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[14]  T. Nigam,et al.  Temperature acceleration of oxide breakdown and its impact on ultra-thin gate oxide reliability , 1999, 1999 Symposium on VLSI Technology. Digest of Technical Papers (IEEE Cat. No.99CH36325).

[15]  Zhenmin Chen A new two-parameter lifetime distribution with bathtub shape or increasing failure rate function , 2000 .

[16]  David Blaauw,et al.  Reliability modeling and management in dynamic microprocessor-based systems , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[17]  Jordi Suñé,et al.  Interplay of voltage and temperature acceleration of oxide breakdown for ultra-thin oxides , 2001 .

[18]  A. TUSTIN,et al.  Automatic Control Systems , 1950, Nature.

[19]  Yun Zhang,et al.  Revisiting the Sequential Programming Model for Multi-Core , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).