Fault Modeling of Extreme Scale Applications Using Machine Learning

Faults are commonplace in large scale systems. These systems experience a variety of faults such as transient, permanent and intermittent. Multi-bit faults are typically not corrected by the hardware resulting in an error. This paper attempts to answer an important question: Given a multi-bit fault in main memory, will it result in an application error - and hence a recovery algorithm should be invoked - or can it be safely ignored? We propose an application fault modeling methodology to answer this question. Given a fault signature (a set of attributes comprising of system and application state), we use machine learning to create a model which predicts whether a multi-bit permanent/transient main memory fault will likely result in error. We present the design elements such as the fault injection methodology for covering important data structures, the application and system attributes which should be used for learning the model, the supervised learning algorithms (and potentially ensembles), and important metrics. We use three applications - NWChem, LULESH and SVM - as examples for demonstrating the effectiveness of the proposed fault modeling methodology.

[1]  Sudhanva Gurumurthi,et al.  Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  David E. Bernholdt,et al.  High performance computational chemistry: An overview of NWChem a distributed parallel application , 2000 .

[3]  David R. Kaeli,et al.  Quantifying software vulnerability , 2008, WREFT '08.

[4]  Zizhong Chen,et al.  Correcting soft errors online in LU factorization , 2013, HPDC '13.

[5]  Amith R. Mamidala,et al.  Automatic Path Migration over InfiniBand: Early Experiences , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[6]  Abhinav Vishnu,et al.  A Software Based Approach for Providing Network Fault Tolerance in Clusters with uDAPL interface: MPI Level Design and Performance Evaluation , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[7]  Ian Karlin,et al.  LULESH 2.0 Updates and Changes , 2013 .

[8]  Amith R. Mamidala,et al.  Topology agnostic hot‐spot avoidance with InfiniBand , 2009, Concurr. Comput. Pract. Exp..

[10]  Song Fu,et al.  F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Harish Patil,et al.  PinADX: an interface for customizable debugging with dynamic instrumentation , 2012, CGO '12.

[14]  Dong Li,et al.  Quantitatively Modeling Application Resilience with the Data Vulnerability Factor , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Martin Schulz,et al.  Fault resilience of the algebraic multi-grid solver , 2012, ICS '12.

[16]  Shuaiwen Song,et al.  Fault-tolerant communication runtime support for data-centric programming models , 2010, 2010 International Conference on High Performance Computing.

[17]  Karthik Pattabiraman,et al.  Soft-LLFI: A Comprehensive Framework for Software Fault Injection , 2014, 2014 IEEE International Symposium on Software Reliability Engineering Workshops.

[18]  Shuaiwen Song,et al.  Designing energy efficient communication runtime systems: a view from PGAS models , 2013, The Journal of Supercomputing.

[19]  Balázs Kégl,et al.  The Higgs boson machine learning challenge , 2014, HEPML@NIPS.

[20]  Shubhendu S. Mukherjee,et al.  Measuring Architectural Vulnerability Factors , 2003, IEEE Micro.

[21]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[22]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Robert J. Harrison,et al.  Liquid water: obtaining the right answer for the right reasons , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[24]  Shuaiwen Song,et al.  Designing Energy Efficient Communication Runtime Systems for Data Centric Programming Models , 2010, 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing.

[25]  William Gropp,et al.  Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..

[26]  Abhinav Vishnu,et al.  A Case for Soft Error Detection and Correction in Computational Chemistry. , 2013, Journal of chemical theory and computation.

[27]  Bronis R. de Supinski,et al.  Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.

[28]  Padma Raghavan,et al.  Characterizing the impact of soft errors on iterative methods in scientific computing , 2011, ICS '11.

[29]  Vilas Sridharan,et al.  A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Greg Bronevetsky,et al.  Proceedings of the 2008 workshop on Radiation effects and fault tolerance in nanometer technologies , 2008 .

[31]  Abhinav Vishnu,et al.  Designing a Scalable Fault Tolerance Model for High Performance Computational Chemistry: A Case Study with Coupled Cluster Perturbative Triples. , 2011, Journal of chemical theory and computation.