Resilient Architecture Design for Voltage Variation

Shrinking feature size and diminishing supply voltage are making circuits sensitive to supply voltage fluctuations within the microprocessor, caused by normal workload activity changes. If left unattended, voltage fluctuations can lead to timing violations or even transistor lifetime issues that degrade processor robustness. Mechanisms that learn to tolerate, avoid, and eliminate voltage fluctuations based on program and microarchitectural events can help steer the processor clear of danger, thus enabling tighter voltage margins that improve performance or lower power consumption. We describe the problem of voltage variation and the factors that influence this variation during processor design and operation. We also describe a variety of runtime hardware and software mitigation techniques that either tolerate, avoid, and/or eliminate voltage violations. We hope processor architects will find the information useful since tolerance, avoidance, and elimination are generalizable constructs that can serve as a basis for addressing other reliability challenges as well. Table of Contents: Introduction / Modeling Voltage Variation / Understanding the Characteristics of Voltage Variation / Traditional Solutions and Emerging Solution Forecast / Allowing and Tolerating Voltage Emergencies / Predicting and Avoiding Voltage Emergencies / Eliminiating Recurring Voltage Emergencies / Future Directions on Resiliency

[1]  B. Beker,et al.  Modeling of power distribution systems for high-performance microprocessors , 1999 .

[2]  Jihong Kim,et al.  Power-aware modulo scheduling for high-performance VLIW processors , 2001, ISLPED '01.

[3]  Saurabh Dighe,et al.  Within-Die Variation-Aware Dynamic-Voltage-Frequency-Scaling With Optimal Core Allocation and Thread Hopping for the 80-Core TeraFLOPS Processor , 2011, IEEE Journal of Solid-State Circuits.

[4]  Margaret Martonosi,et al.  Control techniques to eliminate voltage emergencies in high performance processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[5]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[6]  Meeta Sharma Gupta,et al.  Towards a software approach to mitigate voltage emergencies , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[7]  James Tschanz,et al.  Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[8]  David M. Bull,et al.  RazorII: In Situ Error Detection and Correction for PVT and SER Tolerance , 2009, IEEE Journal of Solid-State Circuits.

[9]  Vivek Tiwari,et al.  Microarchitectural simulation and control of di/dt-induced power supply voltage variation , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[10]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[11]  Kemal Aygun Power Delivery for High-Performance Microprocessors , 2005 .

[12]  T. N. Vijaykumar,et al.  Exploiting resonant behavior to reduce inductive noise , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[13]  Trevor Mudge,et al.  Razor: a low-power pipeline based on circuit-level timing speculation , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[14]  David H. Albonesi,et al.  Mitigating inductive noise in SMT processors , 2004 .

[15]  Keith A. Bowman,et al.  Resilient circuits — Enabling energy-efficient performance and reliability , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[16]  Brooks David,et al.  Voltage Noise: Why It’s Bad, and What To Do About It , 2009 .

[17]  Current demand balancing: a technique for minimization of current surge in high performance clock-gated microprocessors , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[18]  J.F. Martinez,et al.  Cherry: Checkpointed early resource recycling in out-of-order microprocessors , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[19]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[20]  Vasanth Bala,et al.  Dynamo: a transparent dynamic optimization system , 2000, SIGP.

[21]  Meeta Sharma Gupta,et al.  Software-assisted hardware reliability: Abstracting circuit-level challenges to the software stack , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[22]  David M. Brooks,et al.  Eliminating voltage emergencies via microarchitectural voltage control feedback and dynamic optimization , 2004, Proceedings of the 2004 International Symposium on Low Power Electronics and Design (IEEE Cat. No.04TH8758).

[23]  David H. Albonesi,et al.  Front-end policies for improved issue efficiency in SMT processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[24]  Larry L. Biro,et al.  Power considerations in the design of the Alpha 21264 microprocessor , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[25]  Francisco J. Cazorla,et al.  Predictable performance in SMT processors: synergy between the OS and SMTs , 2006, IEEE Transactions on Computers.

[26]  Edward J. McCluskey,et al.  On-line delay testing of digital circuits , 1994, Proceedings of IEEE VLSI Test Symposium.

[27]  George A. Katopis,et al.  Modeling, simulation, and measurement of mid-frequency simultaneous switching noise in computer systems , 1998 .

[28]  Meeta Sharma Gupta,et al.  Eliminating voltage emergencies via software-guided code transformations , 2010, TACO.

[29]  David Blaauw,et al.  Razor II: In Situ Error Detection and Correction for PVT and SER Tolerance , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[30]  Mary Lou Soffa,et al.  Contention aware execution: online contention detection and response , 2010, CGO '10.

[31]  Yu Cao,et al.  New Generation of Predictive Technology Model for Sub-45 nm Early Design Exploration , 2006, IEEE Transactions on Electron Devices.

[32]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[33]  Meeta Sharma Gupta,et al.  Understanding Voltage Variations in Chip Multiprocessors using a Distributed Power-Delivery Network , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[34]  Sanu Mathew,et al.  A 9-GHz 65-nm Intel® Pentium 4 Processor Integer Execution Unit , 2007, IEEE J. Solid State Circuits.

[35]  Shin'ichiro Mutoh,et al.  1-V power supply high-speed digital circuit technology with multithreshold-voltage CMOS , 1995, IEEE J. Solid State Circuits.

[36]  Michael Nicolaidis Time redundancy based soft-error tolerance to rescue nanometer technologies , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[37]  T. N. Vijaykumar,et al.  Pipeline damping: a microarchitectural technique to reduce inductive noise in supply voltage , 2003, ISCA '03.

[38]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[39]  Howard Chen,et al.  Power Management and Its Impact on Power Supply Noise , 2009, PATMOS.

[40]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS 2010.

[41]  Shubhendu S. Mukherjee,et al.  Measuring Architectural Vulnerability Factors , 2003, IEEE Micro.

[42]  David Blaauw,et al.  A Power-Efficient 32 bit ARM Processor Using Timing-Error Detection and Correction for Transient-Error Tolerance and Adaptation to PVT Variation , 2011, IEEE Journal of Solid-State Circuits.

[43]  Katsuyuki Sakuma,et al.  Three-dimensional silicon integration , 2008, IBM J. Res. Dev..

[44]  José F. Martínez,et al.  Checkpointed early load retirement , 2005, 11th International Symposium on High-Performance Computer Architecture.

[45]  Hiroyuki Sugiyama,et al.  A 1.3 GHz fifth generation SPARC64 microprocessor , 2003 .

[46]  Tong Li,et al.  Using OS Observations to Improve Performance in Multicore Systems , 2008, IEEE Micro.

[47]  Sanjay Pant,et al.  A self-tuning DVS processor using delay-error detection and correction , 2005, IEEE Journal of Solid-State Circuits.

[48]  Haitham Akkary,et al.  Checkpoint processing and recovery: towards scalable large instruction window processors , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[49]  Todd M. Austin,et al.  Ultra low-cost defect protection for microprocessor pipelines , 2006, ASPLOS XII.

[50]  Meeta Sharma Gupta,et al.  DeCoR: A Delayed Commit and Rollback mechanism for handling inductive noise in processors , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[51]  Meeta Sharma Gupta,et al.  An event-guided approach to reducing voltage noise in processors , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[52]  K.A. Bowman,et al.  Energy-Efficient and Metastability-Immune Resilient Circuits for Dynamic Variation Tolerance , 2009, IEEE Journal of Solid-State Circuits.

[53]  Meeta Sharma Gupta,et al.  Predicting Voltage Droops Using Recurring Program and Microarchitectural Event Activity , 2010, IEEE Micro.

[54]  S. Tam,et al.  A Dual-Core Multi-Threaded Xeon Processor with 16MB L3 Cache , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[55]  Gu-Yeon Wei,et al.  Measuring Code Optimization Impact on Voltage Noise , 2013 .

[56]  Meeta Sharma Gupta,et al.  Voltage emergency prediction: Using signatures to reduce operating margins , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[57]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[58]  Edward J. McCluskey,et al.  DELAY TESTING OF DIGITAL CIRCUITS BY OUTPUT WAVEFORM ANALYSIS , 1991, 1991, Proceedings. International Test Conference.

[59]  Paolo A. Aseron,et al.  A 45 nm Resilient Microprocessor Core for Dynamic Variation Tolerance , 2011, IEEE Journal of Solid-State Circuits.

[60]  William V. Huott,et al.  Comparison of Split-Versus Connected-Core Supplies in the POWER6 Microprocessor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[61]  David Kaplan,et al.  The core-C6 (CC6) sleep state of the AMD bobcat x86 microprocessor , 2012, ISLPED '12.

[62]  Milo M. K. Martin,et al.  Fast Checkpoint/Recovery to Support Kilo-Instruction Speculation and Hardware Fault Tolerance , 2000 .

[63]  Hsien-Hsin S. Lee,et al.  Noise-Direct: A Technique for Power Supply Noise Aware Floorplanning Using Microarchitecture Profiling , 2007, 2007 Asia and South Pacific Design Automation Conference.

[64]  Ed Grochowski,et al.  Implications of Device Timing Variability on Full Chip Timing , 2007, HPCA.

[65]  Lisa Spainhower,et al.  G4: a fault-tolerant CMOS mainframe , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[66]  Vivek Tiwari,et al.  An architectural solution for the inductive noise problem due to clock-gating , 1999, Proceedings. 1999 International Symposium on Low Power Electronics and Design (Cat. No.99TH8477).

[67]  D. Scott Wills,et al.  On-chip decoupling capacitor optimization using architectural level prediction , 2000, IEEE Trans. Very Large Scale Integr. Syst..

[68]  Xiang Pan,et al.  VRSync: Characterizing and eliminating synchronization-induced voltage emergencies in many-core processors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[69]  Elad Alon,et al.  Integrated Regulation for Energy-Efficient Digital Circuits , 2008, IEEE J. Solid State Circuits.

[70]  Mark C. Toburen Power analysis and instruction scheduling for reduced di/dt in the execution core of high-performanc , 1999 .

[71]  Bishop Brock,et al.  Active management of timing guardband to save energy in POWER7 , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[72]  Michael D. Smith,et al.  Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[73]  Dean M. Tullsen,et al.  Handling long-latency loads in a simultaneous multithreading processor , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[74]  Lizy Kurian John,et al.  AUDIT: Stress Testing the Automatic Way , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[75]  T. N. Vijaykumar,et al.  Pipeline muffling and a priori current ramping: architectural techniques to reduce high-frequency inductive noise , 2003, ISLPED '03.

[76]  Bruce G. Mealey,et al.  IBM POWER6 reliability , 2007, IBM J. Res. Dev..

[77]  Luiz André Barroso,et al.  The Price of Performance , 2005, ACM Queue.

[78]  E. You,et al.  A third-generation SPARC V9 64-b microprocessor , 2000, IEEE Journal of Solid-State Circuits.

[79]  Hiroyuki Sugiyama,et al.  A 1.3 GHz fifth generation SPARC64 microprocessor , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..

[80]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[81]  Vivek De,et al.  Intrinsic MOSFET parameter fluctuations due to random dopant placement , 1997, IEEE Trans. Very Large Scale Integr. Syst..

[82]  Satish Narayanasamy,et al.  BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging , 2005, ISCA 2005.

[83]  Sanjay J. Patel,et al.  ReStore: Symptom-Based Soft Error Detection in Microprocessors , 2006, IEEE Trans. Dependable Secur. Comput..

[84]  Soraya Ghiasi,et al.  A Distributed Critical-Path Timing Monitor for a 65nm High-Performance Microprocessor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[85]  Jerome H. Saltzer,et al.  End-to-end arguments in system design , 1984, TOCS.

[86]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).