CEDAR: Modeling impact of component error derating and read frequency on system-level vulnerability in high-performance processors

Abstract Reliability of the current microprocessor technology is seriously challenged by radiation-induced soft errors. Accurate Vulnerability Factor (VF) modeling of system components is crucial in designing cost-effective protection schemes in high-performance processors. Although Statistical Fault Injection (SFI) techniques can be used to provide relatively accurate VF estimations, they are often very time-consuming. Unlike SFI techniques, recently proposed analytical models can be used to compute VF in a timely fashion. However, VFs computed by such models are inaccurate as the system-level impact of soft errors is overlooked. In this paper, we propose a system-level analytical technique, called Component Error Derating And Read frequency (CEDAR) vulnerability model, combining the advantages of previously presented analytical models and the SFI techniques. The key idea behind CEDAR is to take into account component error derating and read frequency for data-path blocks in high-performance processors. To further investigate the impact of read frequency and component error derating on the system-level VF, we use Input-to-Output Derating (IOD) factor of system components in the proposed analytical model. As a case study, we study system-level vulnerability for cache memory by providing IOD analysis for different processor core configurations. Our experimental results reveal that processor core IOD can significantly affect the system-level vulnerability of cache memories. The experimental results show that CEDAR improves the accuracy of previous analytical VF estimation techniques up to 91% and 5% for write-through and write-back cache memories, respectively, while it speeds up estimation time up to 10× as compared to SFI techniques.

[1]  Arijit Biswas,et al.  Computing Accurate AVFs using ACE Analysis on Performance Models: A Rebuttal , 2008, IEEE Computer Architecture Letters.

[2]  Arun K. Somani,et al.  Soft error sensitivity characterization for microprocessor dependability enhancement strategy , 2002, Proceedings International Conference on Dependable Systems and Networks.

[3]  Dan Alexandrescu,et al.  Panel: Reliability of data centers: Hardware vs. software , 2010, DATE.

[4]  D. Strukov,et al.  The area and latency tradeoffs of binary bit-parallel BCH decoders for prospective nanoelectronic memories , 2006, 2006 Fortieth Asilomar Conference on Signals, Systems and Computers.

[5]  R. Velazco,et al.  Impact of data cache memory on the single event upset-induced error rate of microprocessors , 2003 .

[6]  Arun K. Somani,et al.  Area efficient architectures for information integrity in cache memories , 1999, ISCA.

[7]  Ikhwan Lee,et al.  Survey of Error and Fault Detection Mechanisms , 2011 .

[8]  Arijit Biswas,et al.  Computing architectural vulnerability factors for address-based structures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[9]  Radu Teodorescu,et al.  Dynamic reduction of voltage margins by leveraging on-chip ECC in Itanium II processors , 2013, ISCA.

[10]  Jacob A. Abraham,et al.  Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[11]  Amirali Baniasadi,et al.  System-Level Vulnerability Estimation for Data Caches , 2010, 2010 IEEE 16th Pacific Rim International Symposium on Dependable Computing.

[12]  C.W. Slayman,et al.  Cache and memory error detection, correction, and reduction techniques for terrestrial servers and workstations , 2005, IEEE Transactions on Device and Materials Reliability.

[13]  Amirali Baniasadi,et al.  Using Input-to-Output Masking for System-level Vulnerability estimation in high-performance processors , 2010, 2010 15th CSI International Symposium on Computer Architecture and Digital Systems.

[14]  Li Tang,et al.  Characterizing the L1 Data Cache's Vulnerability to Transient Errors in Chip-Multiprocessors , 2011, 2011 IEEE Computer Society Annual Symposium on VLSI.

[15]  Janak H. Patel,et al.  Reliability of scrubbing recovery-techniques for memory systems , 1990 .

[16]  Hierarchical RTL-based combinatorial SER estimation , 2013, 2013 IEEE 19th International On-Line Testing Symposium (IOLTS).

[17]  Mehdi Baradaran Tahoori,et al.  A Field Analysis of System-level Effects of Soft Errors Occurring in Microprocessors used in Information Systems , 2008, 2008 IEEE International Test Conference.

[18]  Shubu Mukherjee,et al.  Architecture Design for Soft Errors , 2008 .

[19]  Régis Leveugle,et al.  Statistical fault injection: Quantified error and confidence , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[20]  Bin Li,et al.  Predicting Architectural Vulnerability on Multithreaded Processors under Resource Contention and Sharing , 2013, IEEE Transactions on Dependable and Secure Computing.

[21]  Xiaodong Li,et al.  Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[22]  David R. Kaeli,et al.  Eliminating microarchitectural dependency from Architectural Vulnerability , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[23]  Joel Emer,et al.  Computing Architectural Vulnerability Factors for Address-Based Structures , 2005, ISCA 2005.

[24]  Rudy Lauwereins,et al.  Design, Automation, and Test in Europe , 2008 .

[25]  J. Maiz,et al.  Characterization of multi-bit soft error events in advanced SRAMs , 2003, IEEE International Electron Devices Meeting 2003.

[26]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[27]  Mehdi Baradaran Tahoori,et al.  Vulnerability Analysis of L2 Cache Elements to Single Event Upsets , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[28]  Kishor S. Trivedi,et al.  A cache error propagation model , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[29]  Alan Wood,et al.  The impact of new technology on soft error rates , 2011, 2011 International Reliability Physics Symposium.

[30]  Gwan S. Choi,et al.  On-chip cache memory resilience , 1998, Proceedings Third IEEE International High-Assurance Systems Engineering Symposium (Cat. No.98EX231).

[31]  Xiaodong Li,et al.  SoftArch: an architecture-level tool for modeling and analyzing soft errors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[32]  David R. Kaeli,et al.  Using hardware vulnerability factors to enhance AVF analysis , 2010, ISCA.

[33]  Robert Baumann,et al.  Soft errors in advanced computer systems , 2005, IEEE Design & Test of Computers.

[34]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[35]  Sanjay J. Patel,et al.  Examining ACE analysis reliability estimates using fault-injection , 2007, ISCA '07.

[36]  Joel S. Emer,et al.  Techniques to reduce the soft error rate of a high-performance microprocessor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[37]  Pia Sanda,et al.  Statistical Fault Injection , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[38]  J. Fortes,et al.  Sim-SODA : A Unified Framework for Architectural Level Software Reliability Analysis , 2006 .

[39]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[40]  Sarita V. Adve,et al.  Accurate microarchitecture-level fault modeling for studying hardware faults , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[41]  Babak Falsafi,et al.  Mitigating multi-bit soft errors in L1 caches using last-store prediction , 2007 .

[42]  Massimo Violante,et al.  An accurate analysis of the effects of soft errors in the instruction and data caches of a pipelined microprocessor , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[43]  Bruce R. Ellingwood,et al.  Serviceability of earthquake-damaged water systems: Effects of electrical power availability and power backup systems on system vulnerability , 2008, Reliability Engineering & System Safety.

[44]  Kevin Skadron,et al.  Evaluating Overheads of Multibit Soft-Error Protection in the Processor Core , 2013, IEEE Micro.

[45]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[46]  Mehdi Baradaran Tahoori,et al.  Balancing Performance and Reliability in the Memory Hierarchy , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[47]  Shuai Wang,et al.  On the Characterization and Optimization of On-Chip Cache Reliability against Soft Errors , 2009, IEEE Transactions on Computers.