Performance-aware reliability assessment of heterogeneous chips

Technology evolution has raised serious reliability considerations, as transistor dimensions shrink and modern microprocessors become denser and more vulnerable to faults. Reliability studies have proposed a plethora of methodologies for assessing system vulnerability which, however, highly rely on traditional reliability metrics that solely express failure rate over time. Although Failures In Time (FIT) is a very strong and representative reliability metric, it may fail to offer an objective comparison of highly diverse systems, such as CPUs against GPUs or other accelerators that are often employed to execute the same algorithms implemented for these platforms.

[1]  Sudhakar Yalamanchili,et al.  Cooperative boosting: needy versus greedy power management , 2013, ISCA.

[2]  Michail Maniatakos,et al.  Revisiting Vulnerability Analysis in Modern Microprocessors , 2015, IEEE Transactions on Computers.

[3]  Xiaodong Li,et al.  Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[4]  Stijn Eyerman,et al.  A first-order mechanistic model for architectural vulnerability factor , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[5]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[6]  Bo Fang,et al.  GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[7]  Mehdi Baradaran Tahoori,et al.  Fault injection acceleration by architectural importance sampling , 2015, 2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[8]  David A. Wood,et al.  A comparative analysis of microarchitecture effects on CPU and GPU memory system behavior , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[9]  Dimitris Gizopoulos,et al.  Accelerated microarchitectural Fault Injection-based reliability assessment , 2015, 2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS).

[10]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[11]  Sanjay J. Patel,et al.  Examining ACE analysis reliability estimates using fault-injection , 2007, ISCA '07.

[12]  Gabriel H. Loh,et al.  Architectural Vulnerability Modeling and Analysis of Integrated Graphics Processors , 2013 .

[13]  John Lach,et al.  Transient fault models and AVF estimation revisited , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[14]  Joel Emer,et al.  SASSIFI : Evaluating Resilience of GPU Applications , 2015 .

[15]  Ronald G. Dreslinski,et al.  Sources of error in full-system simulation , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[16]  Arijit Biswas,et al.  Computing Accurate AVFs using ACE Analysis on Performance Models: A Rebuttal , 2008, IEEE Computer Architecture Letters.

[17]  Jie Liu,et al.  Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[18]  Minsu Choi,et al.  Workload-dependent relative fault sensitivity and error contribution factor of GPU onchip memory structures , 2013, 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[19]  Arijit Biswas,et al.  Computing architectural vulnerability factors for address-based structures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[20]  Jacob A. Abraham,et al.  Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[21]  Mateo Valero,et al.  FIMSIM: A fault injection infrastructure for microarchitectural simulators , 2011, 2011 IEEE 29th International Conference on Computer Design (ICCD).

[22]  Hyeran Jeon,et al.  Warped-DMR: Light-weight Error Detection for GPGPU , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[23]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[24]  David I. August,et al.  Design and Evaluation of Hybrid Fault-Detection Systems , 2005, ISCA 2005.

[25]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[26]  Josep Torrellas,et al.  Comparing the power and performance of Intel's SCC to state-of-the-art CPUs and GPUs , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[27]  David I. August,et al.  Design and evaluation of hybrid fault-detection systems , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[28]  Mehdi Baradaran Tahoori,et al.  Balancing Performance and Reliability in the Memory Hierarchy , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[29]  Dimitris Gizopoulos,et al.  Anatomy of microarchitecture-level reliability assessment: Throughput and accuracy , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[30]  Dimitris Gizopoulos,et al.  GUFI: A framework for GPUs reliability assessment , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[31]  Dimitris Gizopoulos,et al.  Differential Fault Injection on Microarchitectural Simulators , 2015, 2015 IEEE International Symposium on Workload Characterization.

[32]  Yu Cao,et al.  A resilience roadmap , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[33]  Sarita V. Adve,et al.  Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults , 2012, ASPLOS XVII.

[34]  Michail Maniatakos,et al.  Instruction-Level Impact Analysis of Low-Level Faults in a Modern Microprocessor Controller , 2011, IEEE Transactions on Computers.

[35]  M.D. Beaudry,et al.  PERFORMANCE RELATED RELIABILITY MEASURES FOR COMPUTING SYSTEMS , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[36]  Régis Leveugle,et al.  Statistical fault injection: Quantified error and confidence , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.