Underdesigned and Opportunistic Computing in Presence of Hardware Variability

Microelectronic circuits exhibit increasing variations in performance, power consumption, and reliability parameters across the manufactured parts and across use of these parts over time in the field. These variations have led to increasing use of overdesign and guardbands in design and test to ensure yield and reliability with respect to a rigid set of datasheet specifications. This paper explores the possibility of constructing computing machines that purposely expose hardware variations to various layers of the system stack including software. This leads to the vision of underdesigned hardware that utilizes a software stack that opportunistically adapts to a sensed or modeled hardware. The envisioned underdesigned and opportunistic computing (UnO) machines face a number of challenges related to the sensing infrastructure and software interfaces that can effectively utilize the sensory data. In this paper, we outline specific sensing mechanisms that we have developed and their potential use in building UnO machines.

[1]  Sriram Sankar,et al.  Impact of temperature on hard disk drive reliability in large datacenters , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[2]  Josep Torrellas,et al.  ReCycle:: pipeline adaptation to tolerate process variation , 2007, ISCA '07.

[3]  Mark S. K. Lau,et al.  Energy-aware probabilistic multiplier: design and analysis , 2009, CASES '09.

[4]  Subhasish Mitra,et al.  CASP: Concurrent Autonomous Chip Self-Test Using Stored Test Patterns , 2008, 2008 Design, Automation and Test in Europe.

[5]  Yu Cao,et al.  Circuit aging prediction for low-power operation , 2009, 2009 IEEE Custom Integrated Circuits Conference.

[6]  Meeta Sharma Gupta,et al.  Software-assisted hardware reliability: Abstracting circuit-level challenges to the software stack , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[7]  Tajana Simunic,et al.  Evaluating the impact of job scheduling and power management on processor lifetime for chip multiprocessors , 2009, SIGMETRICS '09.

[8]  Subhasish Mitra,et al.  Overcoming Early-Life Failure and Aging Challenges for Robust System Design , 2013 .

[9]  Onur Mutlu,et al.  Concurrent autonomous self-test for uncore components in system-on-chips , 2010, 2010 28th VLSI Test Symposium (VTS).

[10]  Michael Nicolaidis Time redundancy based soft-error tolerance to rescue nanometer technologies , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[11]  R.W. Johnson,et al.  The changing automotive environment: high-temperature electronics , 2004, IEEE Transactions on Electronics Packaging Manufacturing.

[12]  Mark D. Corner,et al.  Eon: a language and runtime system for perpetual systems , 2007, SenSys '07.

[13]  Stephen P. Boyd,et al.  Self-Tuning for Maximized Lifetime Energy-Efficiency in the Presence of Circuit Aging , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[14]  David Lin,et al.  QED: Quick Error Detection tests for effective post-silicon validation , 2010, 2010 IEEE International Test Conference.

[15]  A.B. Kahng,et al.  Impact of Guardband Reduction On Design Outcomes: A Quantitative Approach , 2009, IEEE Transactions on Semiconductor Manufacturing.

[16]  Melvin A. Breuer,et al.  Defect and error tolerance in the presence of massive numbers of defects , 2004, IEEE Design & Test of Computers.

[17]  Mihaela van der Schaar,et al.  Software adaptation in quality sensitive applications to deal with hardware variability , 2010, GLSVLSI '10.

[18]  C.H. Kim,et al.  An on-die CMOS leakage current sensor for measuring process variation in sub-90nm generations , 2004, 2005 International Conference on Integrated Circuit Design and Technology, 2005. ICICDT 2005..

[19]  John Sartori,et al.  Designing a processor from the ground up to allow voltage/reliability tradeoffs , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[20]  Farzan Fallah,et al.  Quick detection of difficult bugs for effective post-silicon validation , 2012, DAC Design Automation Conference 2012.

[21]  Douglas L. Jones,et al.  Stochastic computation , 2010, Design Automation Conference.

[22]  Woongki Baek,et al.  Green: a framework for supporting energy-conscious programming using controlled approximation , 2010, PLDI '10.

[23]  Naresh R. Shanbhag,et al.  Energy-efficient signal processing via algorithmic noise-tolerance , 1999, Proceedings. 1999 International Symposium on Low Power Electronics and Design (Cat. No.99TH8477).

[24]  Alan Edelman,et al.  PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[25]  Sandeep K. Gupta,et al.  Approximate logic synthesis for error tolerant applications , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[26]  Rajesh Gupta,et al.  Evaluating the effectiveness of model-based power characterization , 2011 .

[27]  Masahiko Yoshimoto,et al.  A Power-Variation Model for Sensor Node and the Impact against Life Time of Wireless Sensor Networks , 2006, 2006 First International Conference on Communications and Electronics.

[28]  David Blaauw,et al.  Making typical silicon matter with Razor , 2004, Computer.

[29]  William Lindsay,et al.  FRITS - a microprocessor functional BIST method , 2002, Proceedings. International Test Conference.

[30]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[31]  Rob A. Rutenbar,et al.  Virtual probe: A statistically optimal framework for minimum-cost silicon characterization of nanoscale integrated circuits , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[32]  Josep Torrellas,et al.  Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors , 2008, 2008 International Symposium on Computer Architecture.

[33]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[34]  Hiroaki Inoue,et al.  VAST: Virtualization-Assisted Concurrent Autonomous Self-Test , 2008, 2008 IEEE International Test Conference.

[35]  David A. Padua,et al.  Optimizing Sorting with Machine Learning Algorithms , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[36]  Amit Patra,et al.  On-Line Testing of Digital Circuits for n-Detect and Bridging Fault Models , 2005, 14th Asian Test Symposium (ATS'05).

[37]  Pedro José Marrón,et al.  Meeting lifetime goals with energy levels , 2007, SenSys '07.

[38]  Lara Dolecek,et al.  Loop flattening & spherical sampling: Highly efficient model reduction techniques for SRAM yield analysis , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[39]  Kartik Mohanram,et al.  Approximate logic circuits for low overhead, non-intrusive concurrent error detection , 2008, 2008 Design, Automation and Test in Europe.

[40]  John Sartori,et al.  Branch and Data Herding: Reducing Control and Memory Divergence for Error-Tolerant GPU Applications , 2012, IEEE Transactions on Multimedia.

[41]  James Tschanz,et al.  Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[42]  K.A. Bowman,et al.  Energy-Efficient and Metastability-Immune Resilient Circuits for Dynamic Variation Tolerance , 2009, IEEE Journal of Solid-State Circuits.

[43]  Sujit Dey,et al.  VESPA: Variability emulation for System-on-Chip performance analysis , 2011, 2011 Design, Automation & Test in Europe.

[44]  Sandeep K. Gupta,et al.  A Re-design Technique for Datapath Modules in Error Tolerant Applications , 2008, 2008 17th Asian Test Symposium.

[45]  Sarita V. Adve,et al.  Using likely program invariants to detect hardware errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[46]  Jian Shen,et al.  Native mode functional test generation for processors with applications to self test and design validation , 1998, Proceedings International Test Conference 1998 (IEEE Cat. No.98CH36270).

[47]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[48]  B.C. Paul,et al.  Process variation in embedded memories: failure analysis and variation aware architecture , 2005, IEEE Journal of Solid-State Circuits.

[49]  Krishna V. Palem,et al.  Energy aware computing through probabilistic switching: a study of limits , 2005, IEEE Transactions on Computers.

[50]  Marco Platzner,et al.  Design and architectures for dependable embedded systems , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[51]  Sujit Dey,et al.  Variation-Tolerant Dynamic Power Management at the System-Level , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[52]  Lara Dolecek,et al.  Probabilistic analysis of Gallager B faulty decoder , 2012, 2012 IEEE International Conference on Communications (ICC).

[53]  Puneet Gupta,et al.  ViPZonE: OS-level memory variability-driven physical address zoning for energy savings , 2012, CODES+ISSS '12.

[54]  David Blaauw,et al.  Dynamic NBTI Management Using a 45 nm Multi-Degradation Sensor , 2011, IEEE Trans. Circuits Syst. I Regul. Pap..

[55]  Alexandru Nicolau,et al.  A Simple Mechanism for Improving the Accuracy and Efficiency of Instruction-Level Disambiguation , 1995, LCPC.

[56]  Karthick Rajamani,et al.  Benchmarking for Power and Performance , 2007 .

[57]  Kaushik Roy,et al.  Scalable effort hardware design: Exploiting algorithmic resilience for energy efficiency , 2010, Design Automation Conference.

[58]  Kaushik Roy,et al.  CRISTA: A New Paradigm for Low-Power, Variation-Tolerant, and Adaptive Circuit Synthesis Using Critical Path Isolation , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[59]  Douglas L. Jones,et al.  Scalable stochastic processors , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[60]  David Blaauw,et al.  Dynamic NBTI management using a 45nm multi-degradation sensor , 2010, IEEE Custom Integrated Circuits Conference 2010.

[61]  Alan J. Weger,et al.  Thermal-aware task scheduling at the system software level , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[62]  Shahin Nazarian,et al.  Thermal Modeling, Analysis, and Management in VLSI Circuits: Principles and Methods , 2006, Proceedings of the IEEE.

[63]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[64]  Subhasish Mitra,et al.  ERSA: Error Resilient System Architecture for probabilistic applications , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[65]  David Blaauw,et al.  Early detection of oxide breakdown through in situ degradation sensing , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[66]  Rakesh Kumar,et al.  A numerical optimization-based methodology for application robustification: Transforming applications for error tolerance , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[67]  Puneet Gupta,et al.  VaMV: Variability-aware Memory Virtualization , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[68]  John Sartori,et al.  Recovery-driven design: A power minimization methodology for error-tolerant processor modules , 2010, Design Automation Conference.

[69]  Puneet Gupta,et al.  Trading Accuracy for Power in a Multiplier Architecture , 2011, J. Low Power Electron..

[70]  Trevor Mudge,et al.  Razor: a low-power pipeline based on circuit-level timing speculation , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[71]  Puneet Gupta,et al.  DDRO: A novel performance monitoring methodology based on design-dependent ring oscillators , 2012, Thirteenth International Symposium on Quality Electronic Design (ISQED).

[72]  Puneet Gupta,et al.  Power Variability in Contemporary DRAMs , 2012, IEEE Embedded Systems Letters.

[73]  Mihaela van der Schaar,et al.  AppAdapt: Opportunistic Application Adaptation in Presence of Hardware Variation , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[74]  Siddharth Garg,et al.  On the impact of manufacturing process variations on the lifetime of sensor networks , 2007, 2007 5th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[75]  David Blaauw,et al.  ElastIC: An Adaptive Self-Healing Architecture for Unpredictable Silicon , 2006, IEEE Design & Test of Computers.

[76]  Ming Zhang,et al.  Circuit Failure Prediction and Its Application to Transistor Aging , 2007, 25th IEEE VLSI Test Symposium (VTS'07).

[77]  Henry Hoffmann,et al.  Dynamic knobs for responsive power-aware computing , 2011, ASPLOS XVI.

[78]  Puneet Gupta,et al.  Hardware Variability-Aware Duty Cycling for Embedded Sensors , 2013, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[79]  Braden J. Phillips,et al.  Arithmetic Data Value Speculation , 2005, Asia-Pacific Computer Systems Architecture Conference.

[80]  Ke Meng,et al.  Process Variation Aware Cache Leakage Management , 2006, ISLPED'06 Proceedings of the 2006 International Symposium on Low Power Electronics and Design.

[81]  Chen-Yong Cher,et al.  Temperature Variation Characterization and Thermal Management of Multicore Architectures , 2009, IEEE Micro.

[82]  Saurabh Dighe,et al.  Within-die variation-aware dynamic-voltage-frequency scaling core mapping and thread hopping for an 80-core processor , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[83]  Edward J. McCluskey,et al.  On-line delay testing of digital circuits , 1994, Proceedings of IEEE VLSI Test Symposium.

[84]  Krishna V. Palem,et al.  Probabilistic arithmetic and energy efficient embedded signal processing , 2006, CASES '06.

[85]  Paul H. Siegel,et al.  Characterizing flash memory: Anomalies, observations, and applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[86]  Sarita V. Adve,et al.  Trace-based microarchitecture-level diagnosis of permanent hardware faults , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[87]  John Sartori,et al.  Slack redistribution for graceful degradation under voltage overscaling , 2010, 2010 15th Asia and South Pacific Design Automation Conference (ASP-DAC).

[88]  Edward J. McCluskey,et al.  ED4I: Error Detection by Diverse Data and Duplicated Instructions , 2002, IEEE Trans. Computers.