A Survey of Techniques for Modeling and Improving Reliability of Computing Systems

Recent trends of aggressive technology scaling have greatly exacerbated the occurrences and impact of faults in computing systems. This has made `reliability' a first-order design constraint. To address the challenges of reliability, several techniques have been proposed. This paper provides a survey of architectural techniques for improving resilience of computing systems. We especially focus on techniques proposed for microarchitectural components, such as processor registers, functional units, cache and main memory etc. In addition, we discuss techniques proposed for non-volatile memory, GPUs and 3D-stacked processors. To underscore the similarities and differences of the techniques, we classify them based on their key characteristics. We also review the metrics proposed to quantify vulnerability of processor structures. We believe that this survey will help researchers, system-architects and processor designers in gaining insights into the techniques for improving reliability of computing systems.

[1]  Osman S. Unsal,et al.  Bit Impact Factor: Towards making fair vulnerability comparison , 2014, Microprocess. Microsystems.

[2]  Sarita V. Adve,et al.  Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults , 2012, ASPLOS XVII.

[3]  Aviral Shrivastava,et al.  A compiler optimization to reduce soft errors in register files , 2009, LCTES '09.

[4]  QingPing Tan,et al.  Scheduling Instructions for Soft Errors in Register Files , 2011, 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing.

[5]  Aviral Shrivastava,et al.  Enabling energy efficient reliability in embedded systems through smart cache cleaning , 2013, ACM Trans. Design Autom. Electr. Syst..

[6]  Vijay S. Pande,et al.  Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU , 2009, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[7]  Osman S. Unsal,et al.  Exploiting Narrow Values for Soft Error Tolerance , 2006, IEEE Computer Architecture Letters.

[8]  Yusuf Leblebici,et al.  A simulation methodology for reliability analysis in multi-core SoCs , 2006, GLSVLSI '06.

[9]  Zhao Zhang,et al.  MASTER: A Multicore Cache Energy-Saving Technique Using Dynamic Cache Reconfiguration , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[10]  Mary W. Hall,et al.  Analyzing the effects of compiler optimizations on application reliability , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[11]  Tao Li,et al.  Managing multi-core soft-error reliability through utility-driven cross domain optimization , 2008, 2008 International Conference on Application-Specific Systems, Architectures and Processors.

[12]  Michel Dubois,et al.  MACAU: A Markov model for reliability evaluations of caches under Single-bit and Multi-bit Upsets , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[13]  Gabriel H. Loh,et al.  Resilient die-stacked DRAM caches , 2013, ISCA.

[14]  Sparsh Mittal,et al.  A Survey of Techniques for Managing and Leveraging Caches in GPUs , 2014, J. Circuits Syst. Comput..

[15]  Huiyang Zhou,et al.  Understanding software approaches for GPGPU reliability , 2009, GPGPU-2.

[16]  Xin Fu,et al.  Optimizing Issue Queue Reliability to Soft Errors on Simultaneous Multithreaded Architectures , 2008, 2008 37th International Conference on Parallel Processing.

[17]  Vilas Sridharan,et al.  A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Stijn Eyerman,et al.  A first-order mechanistic model for architectural vulnerability factor , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[19]  Sudhanva Gurumurthi,et al.  Dynamic prediction of architectural vulnerability from microarchitectural state , 2007, ISCA '07.

[20]  Fangyang Shen,et al.  Modeling and characterizing GPGPU reliability in the presence of soft errors , 2013, Parallel Comput..

[21]  Alfredo Benso,et al.  Statistical Reliability Estimation of Microprocessor-Based Systems , 2012, IEEE Transactions on Computers.

[22]  Aviral Shrivastava,et al.  Mitigating soft error failures for multimedia applications by selective data protection , 2006, CASES '06.

[23]  Nikil D. Dutt,et al.  E < MC2: less energy through multi-copy cache , 2010, CASES '10.

[24]  Dong Li,et al.  Improving energy efficiency of embedded DRAM caches for high-end computing systems , 2014, HPDC '14.

[25]  Aamer Jaleel,et al.  Explaining cache SER anomaly using DUE AVF measurement , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[26]  Wei Zhang,et al.  Compiler-guided register reliability improvement against soft errors , 2005, EMSOFT.

[27]  David Blaauw,et al.  Drowsy caches: simple techniques for reducing leakage power , 2002, ISCA.

[28]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[29]  Mahmut T. Kandemir,et al.  Soft error and energy consumption interactions: a data cache perspective , 2004, Proceedings of the 2004 International Symposium on Low Power Electronics and Design (IEEE Cat. No.04TH8758).

[30]  Vijayalakshmi Srinivasan,et al.  Efficient scrub mechanisms for error-prone emerging memories , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[31]  Mahmut T. Kandemir,et al.  Thread vulnerability in parallel applications , 2012, J. Parallel Distributed Comput..

[32]  N. Seifert,et al.  Timing vulnerability factors of sequentials , 2004, IEEE Transactions on Device and Materials Reliability.

[33]  David R. Kaeli,et al.  A Taxonomy to Enable Error Recovery and Correction in Software , 2008 .

[34]  Sanjay J. Patel,et al.  Examining ACE analysis reliability estimates using fault-injection , 2007, ISCA '07.

[35]  Joel S. Emer,et al.  Techniques to reduce the soft error rate of a high-performance microprocessor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[36]  Mahmut T. Kandemir,et al.  Modeling and improving data cache reliability: 1 , 2007, SIGMETRICS '07.

[37]  Wu-chun Feng,et al.  Making a Case for Efficient Supercomputing , 2003, ACM Queue.

[38]  Muhammad Shafique,et al.  Leveraging variable function resilience for selective software reliability on unreliable hardware , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[39]  Bianca Schroeder,et al.  Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[40]  Aviral Shrivastava,et al.  Cache vulnerability equations for protecting data in embedded processor caches from soft errors , 2010, LCTES '10.

[41]  John Lach,et al.  Transient fault models and AVF estimation revisited , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[42]  Sarita V. Adve,et al.  GangES: Gang error simulation for hardware resiliency evaluation , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[43]  Jie Liu,et al.  Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[44]  Margaret Martonosi,et al.  Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[45]  Rajeev Balasubramonian,et al.  Leveraging 3D Technology for Improved Reliability , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[46]  Christopher Gonzalez,et al.  5.1 POWER8TM: A 12-core server-class processor in 22nm SOI with 7.6Tb/s off-chip bandwidth , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[47]  Mehdi Baradaran Tahoori,et al.  Reducing Data Cache Susceptibility to Soft Errors , 2006, IEEE Transactions on Dependable and Secure Computing.

[48]  Anand Sivasubramaniam,et al.  Mechanisms for bounding vulnerabilities of processor structures , 2007, ISCA '07.

[49]  Xiaodong Li,et al.  Online Estimation of Architectural Vulnerability Factor for Soft Errors , 2008, 2008 International Symposium on Computer Architecture.

[50]  Doe Hyun Yoon,et al.  Memory mapped ECC: low-cost error protection for last level caches , 2009, ISCA '09.

[51]  Muhammad Shafique,et al.  Reliable software for unreliable hardware: Embedded code generation aiming at reliability , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[52]  Hsien-Hsin S. Lee,et al.  Tri-level-cell phase change memory: toward an efficient and reliable memory system , 2013, ISCA.

[53]  Jaume Abella,et al.  Selective replication: A lightweight technique for soft errors , 2009, TOCS.

[54]  Soontae Kim,et al.  TEPS: Transient Error Protection Utilizing Sub-word Parallelism , 2009, 2009 IEEE Computer Society Annual Symposium on VLSI.

[55]  Jörg Henkel,et al.  Self-Immunity Technique to Improve Register File Integrity Against Soft Errors , 2011, 2011 24th Internatioal Conference on VLSI Design.

[56]  Arun K. Somani,et al.  Area efficient architectures for information integrity in cache memories , 1999, ISCA.

[57]  Diana Franklin,et al.  Efficient fault tolerance in multi-media applications through selective instruction replication , 2008, WREFT '08.

[58]  Wei Chen,et al.  5.4 Ivytown: A 22nm 15-core enterprise Xeon® processor family , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[59]  Sriram Krishnamoorthy,et al.  Compiler-assisted detection of transient memory errors , 2014, PLDI.

[60]  Nanning Zheng,et al.  Using Magnetic RAM to Build Low-Power and Soft Error-Resilient L1 Cache , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[61]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[62]  Ram Huggahalli,et al.  Impact of Cache Coherence Protocols on the Processing of Network Traffic , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[63]  Rui Shen,et al.  PRASE: An Approach for Program Reliability Analysis with Soft Errors , 2008, 2008 14th IEEE Pacific Rim International Symposium on Dependable Computing.

[64]  Bin Li,et al.  Versatile prediction and fast estimation of Architectural Vulnerability Factor from processor performance metrics , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[65]  Dong Li,et al.  A Survey Of Architectural Approaches for Managing Embedded DRAM and Non-Volatile On-Chip Caches , 2015, IEEE Transactions on Parallel and Distributed Systems.

[66]  David R. Kaeli,et al.  Eliminating microarchitectural dependency from Architectural Vulnerability , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[67]  William J. Bowhill,et al.  A 32 nm, 3.1 Billion Transistor, 12 Wide Issue Itanium® Processor for Mission-Critical Servers , 2012, IEEE Journal of Solid-State Circuits.

[68]  Hui Wang,et al.  Dynamic Error Detection for Dependable Cache Coherency in Multicore Architectures , 2008, 21st International Conference on VLSI Design (VLSID 2008).

[69]  Nihar R. Mahapatra,et al.  Energy-Efficient Soft-Error Protection Using Operand Encoding and Operation Bypass , 2008, 21st International Conference on VLSI Design (VLSID 2008).

[70]  Mahmut T. Kandemir,et al.  Feedback control based cache reliability enhancement for emerging multicores , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[71]  Tryggve Fossum,et al.  Cache scrubbing in microprocessors: myth or necessity? , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..

[72]  Narayanan Vijaykrishnan,et al.  Impact of dynamic voltage and frequency scaling on the architectural vulnerability of GALS architectures , 2008, Proceeding of the 13th international symposium on Low power electronics and design (ISLPED '08).

[73]  Michael F. P. O'Boyle,et al.  Evaluating the Effects of Compiler Optimisations on AVF , 2008 .

[74]  Victor V. Zyuban,et al.  IBM POWER7+ design for higher frequency at fixed power , 2013, IBM J. Res. Dev..

[75]  Wei Zhang,et al.  Replication cache: a small fully associative cache to improve data cache reliability , 2005, IEEE Transactions on Computers.

[76]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[77]  Wei Zhang,et al.  Computing cache vulnerability to transient errors and its implication , 2005, 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT'05).

[78]  Sudhanva Gurumurthi,et al.  Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[79]  E. Normand Single-event effects in avionics , 1996 .

[80]  Tao Li,et al.  Characterizing Microarchitecture Soft Error Vulnerability Phase Behavior , 2006, 14th IEEE International Symposium on Modeling, Analysis, and Simulation.

[81]  Tao Li,et al.  Microarchitecture soft error vulnerability characterization and mitigation under 3D integration technology , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[82]  Dong Li,et al.  Quantitatively Modeling Application Resilience with the Data Vulnerability Factor , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[83]  N. Ranganathan,et al.  A Framework for Correction of Multi-Bit Soft Errors in L2 Caches Based on Redundancy , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[84]  Bo Fang,et al.  GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[85]  Dong Li,et al.  DESTINY: A tool for modeling emerging 3D NVM and eDRAM caches , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[86]  Makoto Sugihara,et al.  Task Scheduling for Reliable Cache Architectures of Multiprocessor Systems , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[87]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[88]  Arijit Biswas,et al.  Computing architectural vulnerability factors for address-based structures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[89]  Koushik Chakraborty,et al.  Mixed-mode multicore reliability , 2009, ASPLOS.

[90]  Gunar Schirner,et al.  Application-specific power-efficient approach for reducing register file vulnerability , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[91]  Wei Zhang,et al.  ICR: in-cache replication for enhancing data cache reliability , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[92]  Norman P. Jouppi,et al.  FREE-p: Protecting non-volatile memory against both hard and soft errors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[93]  Xin Fu,et al.  RISE: Improving the streaming processors reliability against soft errors in GPGPUs , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[94]  Shuai Wang,et al.  On the Exploitation of Narrow-Width Values for Improving Register File Reliability , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[95]  Babak Falsafi,et al.  Mitigating multi-bit soft errors in L1 caches using last-store prediction , 2007 .

[96]  Sparsh Mittal,et al.  A survey of architectural techniques for improving cache power efficiency , 2014, Sustain. Comput. Informatics Syst..

[97]  Jun Yan,et al.  Evaluating instruction cache vulnerability to transient errors , 2006, MEDEA '06.

[98]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[99]  Narayanan Vijaykrishnan,et al.  Towards Resilient Micro-architectures: Datapath Reliability Enhancement Using STT-MRAM , 2011, 2011 IEEE Computer Society Annual Symposium on VLSI.

[100]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[101]  Soontae Kim Area-Efficient Error Protection for Caches , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[102]  Mahmut T. Kandemir,et al.  Increasing register file immunity to transient errors , 2005, Design, Automation and Test in Europe.

[103]  Daniel J. Sorin,et al.  Choosing an Error Protection Scheme for a Microprocessor's L1 Data Cache , 2006, 2006 International Conference on Computer Design.

[104]  Sudhakar M. Reddy,et al.  Cache size selection for performance, energy and reliability of time-constrained systems , 2006, Asia and South Pacific Conference on Design Automation, 2006..

[105]  Doe Hyun Yoon,et al.  Virtualized and flexible ECC for main memory , 2010, ASPLOS XV.

[106]  Nanning Zheng,et al.  Architecting high-performance energy-efficient soft error resilient cache under 3D integration technology , 2011, Microprocess. Microsystems.

[107]  David R. Kaeli,et al.  Using hardware vulnerability factors to enhance AVF analysis , 2010, ISCA.

[108]  Mikko H. Lipasti,et al.  Precision-aware soft error protection for GPUs , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[109]  Mehdi Baradaran Tahoori,et al.  Balancing Performance and Reliability in the Memory Hierarchy , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[110]  Ben H. H. Juurlink,et al.  Protective redundancy overhead reduction using instruction vulnerability factor , 2010, Conf. Computing Frontiers.

[111]  Shuai Wang,et al.  On the Characterization and Optimization of On-Chip Cache Reliability against Soft Errors , 2009, IEEE Transactions on Computers.

[112]  Yuan Xie,et al.  Exploring the vulnerability of CMPs to soft errors with 3D stacked non-volatile memory , 2011, 2011 IEEE 29th International Conference on Computer Design (ICCD).