MOARD: Modeling Application Resilience to Transient Faults on Data Objects

Understanding application resilience (or error tolerance) in the presence of hardware transient faults on data objects is critical to ensure computing integrity and enable efficient application-level fault tolerance mechanisms. However, we lack a method and a tool to quantify application resilience to transient faults on data objects. The traditional method, random fault injection, cannot help, because of losing data semantics and insufficient information on how and where errors are tolerated. In this paper, we introduce a method and a tool (called "MOARD") to model and quantify application resilience to transient faults on data objects. Our method is based on systematically quantifying error masking events caused by application-inherent semantics and program constructs. We use MOARD to study how and why errors in data objects can be tolerated by the application. We demonstrate tangible benefits of using MOARD to direct a fault tolerance mechanism to protect data objects.

[1]  V. E. Henson,et al.  BoomerAMG: a parallel algebraic multigrid solver and preconditioner , 2002 .

[2]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Marc Snir,et al.  FlipIt: An LLVM Based Fault Injector for HPC , 2014, Euro-Par Workshops.

[4]  Alan Edelman,et al.  PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[5]  Sarita V. Adve,et al.  Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults , 2012, ASPLOS XVII.

[6]  David R. Kaeli,et al.  Eliminating microarchitectural dependency from Architectural Vulnerability , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[7]  Ravishankar K. Iyer,et al.  Application-based metrics for strategic placement of detectors , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).

[8]  Arijit Biswas,et al.  Computing architectural vulnerability factors for address-based structures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[9]  Johan Karlsson,et al.  One Bit is (Not) Enough: An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[10]  Régis Leveugle,et al.  Statistical fault injection: Quantified error and confidence , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[11]  Bin Li,et al.  Versatile prediction and fast estimation of Architectural Vulnerability Factor from processor performance metrics , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[12]  Ganesh Gopalakrishnan,et al.  Towards Formal Approaches to System Resilience , 2013, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing.

[13]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Zizhong Chen,et al.  On-line soft error correction in matrix-matrix multiplication , 2013, J. Comput. Sci..

[15]  Xiaodong Li,et al.  Online Estimation of Architectural Vulnerability Factor for Soft Errors , 2008, 2008 International Symposium on Computer Architecture.

[16]  Dong Li,et al.  Quantitatively Modeling Application Resilience with the Data Vulnerability Factor , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[18]  Martin Schulz,et al.  Fault resilience of the algebraic multi-grid solver , 2012, ICS '12.

[19]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[20]  Martin Schulz,et al.  Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[21]  David M. Brooks,et al.  ISA-independent workload characterization and its implications for specialized architectures , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[22]  Sarita V. Adve,et al.  Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23]  Abhinav Vishnu,et al.  Fault Modeling of Extreme Scale Applications Using Machine Learning , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[24]  Donald Yeung,et al.  Application-Level Correctness and its Impact on Fault Tolerance , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[25]  Padma Raghavan,et al.  Characterizing the impact of soft errors on iterative methods in scientific computing , 2011, ICS '11.

[26]  David H. Bailey,et al.  NAS parallel benchmark results , 1992, Proceedings Supercomputing '92.

[27]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[28]  Zizhong Chen,et al.  Correcting soft errors online in LU factorization , 2013, HPDC '13.

[29]  Thomas Hérault,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.

[30]  Zizhong Chen Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.

[31]  Brandon Lucia,et al.  Alpaca: intermittent execution without checkpoints , 2017, Proc. ACM Program. Lang..

[32]  Henry Hoffmann,et al.  Patterns and statistical analysis for understanding reduced resource computing , 2010, OOPSLA.

[33]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[34]  Zizhong Chen,et al.  Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[35]  Harshitha Menon,et al.  DisCVar: discovering critical variables using algorithmic differentiation for transient faults , 2018, PPoPP.

[36]  Pradip Bose,et al.  Understanding Error Propagation in GPGPU Applications , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[37]  Sarita V. Adve,et al.  GangES: Gang error simulation for hardware resiliency evaluation , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).