Experimental and Analytical Study of Xeon Phi Reliability

We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon Phi processors based on radiation experiments and high-level fault injection. Besides measuring the realistic error rates of Xeon Phi, we quantify Silent Data Corruption (SDCs) by correlating the distribution of corrupted elements in the output to the application's characteristics. We evaluate the benefits of imprecise computing for reducing the programs' error rate. For example, for HotSpot a 0.5% tolerance in the output value reduces the error rate by 85%. We inject different fault models to analyze the sensitivity of given applications. We show that portions of applications can be graded by different criticalities. For example, faults occurring in the middle of LUD execution, or in the Sort and Tree portions of CLAMR, are more critical than the remaining portions. Mitigation techniques can then be relaxed or hardened based on the criticality of the particular portions.

[1]  Joel Emer,et al.  SASSIFI : Evaluating Resilience of GPU Applications , 2015 .

[2]  B. L. Bhuva,et al.  Comparison of Combinational and Sequential Error Rates for a Deep Submicron Process , 2011, IEEE Transactions on Nuclear Science.

[3]  Stephen W. Keckler,et al.  SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[4]  Sudhakar Yalamanchili,et al.  Reliability-performance tradeoffs between 2.5D and 3D-stacked DRAM processors , 2016, 2016 IEEE International Reliability Physics Symposium (IRPS).

[5]  J-C. Laprie,et al.  DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[6]  John Shalf,et al.  DOE Advanced Scientific Computing Advisory Subcommittee (ASCAC) Report: Top Ten Exascale Research Challenges , 2014 .

[7]  William M. Jones,et al.  Towards Building Resilient Scientific Applications: Resilience Analysis on the Impact of Soft Error and Transient Error Tolerance with the CLAMR Hydrodynamics Mini-App , 2015, 2015 IEEE International Conference on Cluster Computing.

[8]  A. Oates,et al.  Characterization of Single Bit and Multiple Cell Soft Error Events in Planar and FinFET SRAMs , 2016, IEEE Transactions on Device and Materials Reliability.

[9]  Meeta Sharma Gupta,et al.  Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Franck Cappello,et al.  Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC Applications , 2016, Euro-Par.

[11]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Luigi Carro,et al.  Understanding GPU errors on large-scale HPC systems and the implications for system design and operation , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[13]  Steven M. Guertin,et al.  Using Benchmarks for Radiation Testing of Microprocessors and FPGAs , 2015, IEEE Transactions on Nuclear Science.

[14]  Yo-Hwan Koh,et al.  A low power and highly reliable 400Mbps mobile DDR SDRAM with on-chip distributed ECC , 2007, 2007 IEEE Asian Solid-State Circuits Conference.

[15]  Robert Baumann,et al.  Soft errors in advanced computer systems , 2005, IEEE Design & Test of Computers.

[16]  S. Pontarelli,et al.  A New Hardware/Software Platform and a New 1/E Neutron Source for Soft Error Studies: Testing FPGAs at the ISIS Facility , 2007, IEEE Transactions on Nuclear Science.

[17]  Dimitris Gizopoulos,et al.  GUFI: A framework for GPUs reliability assessment , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[18]  M. Baze,et al.  Comparison of error rates in combinational and sequential logic , 1997 .

[19]  Michael Nicolaidis Time redundancy based soft-error tolerance to rescue nanometer technologies , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[20]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[21]  Luigi Carro,et al.  GPGPUs: How to combine high computational power with high reliability , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[22]  Jean-Claude Laprie,et al.  Dependable computing: concepts, limits, challenges , 1995 .

[23]  Dhiraj K. Pradhan,et al.  Single element correction in sorting algorithms with minimum delay overhead , 2009, 2009 10th Latin American Test Workshop.

[24]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[25]  Pradip Bose,et al.  Understanding Error Propagation in GPGPU Applications , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Melvin A. Breuer,et al.  Multi-media applications and imprecise computation , 2005, 8th Euromicro Conference on Digital System Design (DSD'05).

[27]  L. Carro,et al.  An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs , 2013, IEEE Transactions on Nuclear Science.

[28]  David Blaauw,et al.  Using Low Cost Erasure and Error Correction Schemes to Improve Reliability of Commodity DRAM Systems , 2016, IEEE Transactions on Computers.

[29]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[30]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[31]  Thiago Santini,et al.  Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units , 2016, IEEE Transactions on Computers.

[32]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[33]  Bo Fang,et al.  GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[34]  Gokcen Kestor,et al.  Understanding the propagation of transient errors in HPC applications , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[35]  Mauricio Hanzich,et al.  Mimetic seismic wave modeling including topography on deformed staggered grids , 2014 .

[36]  Cristian Constantinescu,et al.  Impact of deep submicron technology on dependability of VLSI circuits , 2002, Proceedings International Conference on Dependable Systems and Networks.

[37]  Luigi Carro,et al.  Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[38]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[39]  Laura Monroe,et al.  GPU Behavior on a Large HPC Cluster , 2013, Euro-Par Workshops.

[40]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[41]  Robyn R. Lutz,et al.  Analyzing software requirements errors in safety-critical, embedded systems , 1993, [1993] Proceedings of the IEEE International Symposium on Requirements Engineering.

[42]  Ravishankar K. Iyer,et al.  An experimental study of soft errors in microprocessors , 2005, IEEE Micro.

[43]  Claus Braun,et al.  Efficacy and efficiency of algorithm-based fault-tolerance on GPUs , 2013, 2013 IEEE 19th International On-Line Testing Symposium (IOLTS).