Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience

Abstract Exascale studies project reliability challenges for future high-performance computing (HPC) systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVR's interfaces to distributed arrays, versioning, and cross-layer error recovery. Using several large applications (OpenMC, the preconditioned conjugate gradient solver PCG, ddcMD, and Chombo), we evaluate the programmer effort to add resilience. The required changes are small ( 2% LOC), localized, and machine-independent, requiring no software architecture changes. We also measure the overhead of adding GVR versioning and show that generally overheads 2% are achieved. We conclude that GVR's interfaces and implementation are flexible and portable and create a gentle-slope path to tolerate growing error rates in future systems.

[1]  M. Berger,et al.  Adaptive mesh refinement for hyperbolic partial differential equations , 1982 .

[2]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[3]  John A. Gunnels,et al.  100 + TFlop Solidification Simulations on BlueGene / L , 2005 .

[4]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[5]  Benoit Forget,et al.  The OpenMC Monte Carlo particle transport code , 2012 .

[6]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[7]  Andrew A. Chien,et al.  Log-Structured Global Array for Efficient Multi-Version Snapshots , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[8]  P. Colella Multidimensional upwind methods for hyperbolic conservation laws , 1990 .

[9]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[10]  Robert E. Lyons,et al.  The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[11]  Andrew A. Chien,et al.  Error Checking and Snapshot-Based Recovery in a Preconditioned Conjugate Gradient Solver , 2013 .

[12]  Martin C. Rinard,et al.  Verifying quantitative reliability for programs that execute on unreliable hardware , 2013, OOPSLA.

[13]  William R. Martin,et al.  THE MONTE CARLO PERFORMANCE BENCHMARK TEST - AIMS, SPECIFICATIONS AND FIRST RESULTS , 2011 .

[14]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Heather Quinn,et al.  Final report for CCS cross-layer reliability visioning study , 2010 .

[16]  Michael A. Heroux Toward resilient algorithms and applications , 2013, FTXS '13.

[17]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[18]  Ron Brightwell,et al.  Cooperative Application/OS DRAM Fault Recovery , 2011, Euro-Par Workshops.

[19]  Tamara G. Kolda,et al.  An overview of the Trilinos project , 2005, TOMS.

[20]  Josef Bacik,et al.  BTRFS: The Linux B-Tree Filesystem , 2013, TOS.

[21]  Gene H. Golub,et al.  Matrix computations , 1983 .

[22]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[23]  Sandia Report,et al.  HPCG Technical Specification , 2013 .

[24]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[25]  Koji Sato,et al.  The Linux implementation of a log-structured file system , 2006, OPSR.

[26]  John A. Gunnels,et al.  Simulating solidification in metals at high pressure: The drive to petascale computing , 2006 .

[27]  Thomas Hérault,et al.  An evaluation of User-Level Failure Mitigation support in MPI , 2012, Computing.

[28]  Sally A. McKee,et al.  ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[29]  Kunle Olukotun,et al.  The Future of Microprocessors , 2005, ACM Queue.

[30]  John A. Gunnels,et al.  Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[31]  Dhabaleswar K. Panda,et al.  CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems , 2009, 2009 International Conference on Parallel Processing.

[32]  Zizhong Chen,et al.  Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[33]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[34]  Narayan Desai,et al.  Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[35]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[36]  Andrew A. Chien,et al.  When is multi-version checkpointing needed? , 2013, FTXS '13.

[37]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[38]  Bronis R. de Supinski,et al.  Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.

[39]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[40]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[41]  Mark F. Adams,et al.  Chombo Software Package for AMR Applications Design Document , 2014 .

[42]  P. Colella,et al.  Local adaptive mesh refinement for shock hydrodynamics , 1989 .

[43]  Andrew A. Chien,et al.  Data decomposition in Monte Carlo neutron transport simulations using global view arrays , 2015, Int. J. High Perform. Comput. Appl..

[44]  Dan Grossman,et al.  EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.

[45]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[46]  Martin C. Rinard,et al.  Chisel: reliability- and accuracy-aware optimization of approximate computational kernels , 2014, OOPSLA.