Application-Specific Fault Tolerance via Data Access Characterization

Recent trends in semiconductor technology and supercomputer design predict an increasing probability of faults during an application's execution. Designing an application that is resilient to system failures requires careful evaluation of the impact of various approaches on preserving key application state. In this paper, we present our experiences in an ongoing effort to make a large computational chemistry application fault tolerant. We construct the data access signatures of key application modules to evaluate alternative fault tolerance approaches. We present the instrumentation methodology, characterization of the application modules, and evaluation of fault tolerance techniques using the information collected. The application signatures developed capture application characteristics not traditionally revealed by performance tools. We believe these can be used in the design and evaluation of runtimes beyond fault tolerance.

[1]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[2]  A. Szabo,et al.  Modern quantum chemistry , 1982 .

[3]  R. Martin,et al.  Electronic Structure: Basic Theory and Practical Methods , 2004 .

[4]  Carla Schlatter Ellis,et al.  File-Access Characteristics of Parallel Scientific Workloads , 1996, IEEE Trans. Parallel Distributed Syst..

[5]  Michael C. Zerner,et al.  The linked singles and doubles model: An approximate theory of electron correlation based on the coupled‐cluster ansatz , 1982 .

[6]  Franck Cappello,et al.  On Communication Determinism in Parallel HPC Applications , 2010, 2010 Proceedings of 19th International Conference on Computer Communications and Networks.

[7]  Sriram Krishnamoorthy,et al.  Tolerating correlated failures for generalized Cartesian distributions via bipartite matching , 2011, CF '11.

[8]  George Bosilca,et al.  Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..

[9]  Richard M. Martin Electronic Structure: Frontmatter , 2004 .

[10]  Roland Lindh,et al.  Utilizing high performance computing for chemistry: parallel computational chemistry. , 2010, Physical chemistry chemical physics : PCCP.

[11]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[12]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[13]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[14]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[15]  Sriram Krishnamoorthy,et al.  A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models , 2011, 2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing.

[16]  R. Parr Density-functional theory of atoms and molecules , 1989 .

[17]  W. Kohn,et al.  Self-Consistent Equations Including Exchange and Correlation Effects , 1965 .

[18]  Robert Latham,et al.  24/7 Characterization of petascale I/O workloads , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[19]  Philip C. Roth,et al.  Characterizing the I/O behavior of scientific applications on the Cray XT , 2007, PDSW '07.

[20]  M. Ratner Molecular electronic-structure theory , 2000 .

[21]  John P. Perdew,et al.  Jacob’s ladder of density functional approximations for the exchange-correlation energy , 2001 .

[22]  Zizhong Chen,et al.  Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[23]  Martin Schulz,et al.  PNMPI tools: a whole lot greater than the sum of their parts , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[24]  R. Bartlett,et al.  Coupled-cluster theory in quantum chemistry , 2007 .

[25]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[26]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[27]  R. Bartlett,et al.  A full coupled‐cluster singles and doubles model: The inclusion of disconnected triples , 1982 .

[28]  Rick Stevens,et al.  Toward high‐performance computational chemistry: II. A scalable self‐consistent field program , 1996 .