Automating Fault Tolerance in High-Performance Computational Biological Jobs Using Multi-Agent Approaches

BACKGROUND Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost on the time taken for reinstating the job and the risk of losing data and execution accomplished by the job before it failed. Approaches which can proactively detect computing core failures and take action to relocate the computing core׳s job onto reliable cores can make a significant step towards automating fault tolerance. METHOD This paper describes an experimental investigation into the use of multi-agent approaches for fault tolerance. Two approaches are studied, the first at the job level and the second at the core level. The approaches are investigated for single core failure scenarios that can occur in the execution of parallel reduction algorithms on computer clusters. A third approach is proposed that incorporates multi-agent technology both at the job and core level. Experiments are pursued in the context of genome searching, a popular computational biology application. RESULT The key conclusion is that the approaches proposed are feasible for automating fault tolerance in high-performance computing systems with minimal human intervention. In a typical experiment in which the fault tolerance is studied, centralised and decentralised checkpointing approaches on an average add 90% to the actual time for executing the job. On the other hand, in the same experiment the multi-agent approaches add only 10% to the overall execution time.

[1]  Gene Cooperman,et al.  DMTCP: Transparent checkpointing for cluster computations and the desktop , 2007, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[2]  Chi-Hsiang Yeh The robust middleware approach for transparent and systematic fault tolerance in parallel and distributed systems , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[3]  Douglas M. Blough,et al.  Distributed diagnosis in dynamic fault environments , 2004, IEEE Transactions on Parallel and Distributed Systems.

[4]  Laxmikant V. Kalé,et al.  Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.

[5]  T. K. Altheide,et al.  Comparing the human and chimpanzee genomes: Searching for needles in a haystack , 2005 .

[6]  Christian Engelmann,et al.  A Framework for Proactive Fault Tolerance , 2008, 2008 Third International Conference on Availability, Reliability and Security.

[7]  Andrew Lumsdaine,et al.  The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[8]  Sy-Yen Kuo,et al.  Theoretical Analysis for Communication-Induced Checkpointing Protocols with Rollback-Dependency Trackability , 1998, IEEE Trans. Parallel Distributed Syst..

[9]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[10]  Zizhong Chen,et al.  N-Level Diskless Checkpointing , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.

[11]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[12]  Brian K. Shoichet,et al.  Computational biology and high performance computing , 1999 .

[13]  Rolf Riesen,et al.  Fault-tolerance for exascale systems , 2010, 2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS).

[14]  D. K. Arvind,et al.  Languages and Compilers for Parallel Computing , 2014, Lecture Notes in Computer Science.

[15]  Hai Jiang,et al.  Process/thread migration and checkpointing in heterogeneous distributed systems , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[16]  George Bosilca,et al.  Redesigning the message logging model for high performance , 2010, Concurr. Comput. Pract. Exp..

[17]  Y. Li,et al.  Current Research and Practice in Proactive Fault Management , 2007 .

[18]  Douglas J. Tobias,et al.  Vector and parallel algorithms for the molecular dynamics simulation of macromolecules on shared‐memory computers , 1991 .

[19]  Baharan Mirzasoleiman,et al.  Failure Tolerance of Motif Structure in Biological Networks , 2011, PloS one.

[20]  Cho-Li Wang,et al.  Scalable group-based checkpoint/restart for large-scale message-passing systems , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[21]  M. J. Quinn,et al.  Parallel Computing: Theory and Practice , 1994 .

[22]  Federico D. Sacerdoti,et al.  Scalable Algorithms for Molecular Dynamics Simulations on Commodity Clusters , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[23]  Patricia González,et al.  Fault-tolerant solutions for a MPI compute intensive application , 2007, 15th EUROMICRO International Conference on Parallel, Distributed and Network-Based Processing (PDP'07).

[24]  Raimundo José de Araújo Macêdo,et al.  An Adaptive Programming Model for Fault-Tolerant Distributed Computing , 2007, IEEE Transactions on Dependable and Secure Computing.

[25]  Filip De Turck,et al.  Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids , 2009, IEEE Transactions on Parallel and Distributed Systems.

[26]  Rajkumar Buyya,et al.  High Performance Cluster Computing: Programming and Applications , 1999 .

[27]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[28]  Ramón Díaz-Uriarte,et al.  ADaCGH: A Parallelized Web-Based Application and R Package for the Analysis of aCGH Data , 2007, PloS one.

[29]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[30]  Meeta Sharma Gupta,et al.  Performance implications of periodic checkpointing on large-scale cluster systems , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[31]  John Paul Walters,et al.  Replication-Based Fault Tolerance for MPI Applications , 2009, IEEE Transactions on Parallel and Distributed Systems.

[32]  Daniel Okunbor,et al.  Efficient parallel algorithms for molecular dynamics simulations , 1999, Parallel Comput..

[33]  Jose Renato Santos,et al.  Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[34]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[35]  Qi Sun,et al.  BioHPC: Computational Biology Application Suite for High Performance Computing , 2010 .

[36]  Zizhong Chen,et al.  Algorithm-Based Fault Tolerance for Fail-Stop Failures , 2008, IEEE Transactions on Parallel and Distributed Systems.

[37]  Israel Koren,et al.  Fault-Tolerant Systems , 2007 .

[38]  Zhiling Lan,et al.  Adaptive Fault Management of Parallel Applications for High-Performance Computing , 2008, IEEE Transactions on Computers.

[39]  William H. Sanders,et al.  An Adaptive Algorithm for Tolerating Value Faults and Crash Failures , 2001, IEEE Trans. Parallel Distributed Syst..

[40]  Christian Engelmann,et al.  Proactive Fault Tolerance Using Preemptive Migration , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[41]  Michael Wooldridge,et al.  An Introduction to MultiAgent Systems, Second Edition , 2009 .

[42]  P. Zipperlen,et al.  Functional genomic analysis of C. elegans chromosome I by systematic RNA interference , 2000, Nature.

[43]  Zizhong Chen,et al.  Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing , 2005, Int. J. High Perform. Comput. Appl..

[44]  Rajkumar Buyya,et al.  High Performance Cluster Computing , 1999 .

[45]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[46]  Axel W. Krings,et al.  Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing , 2009, IEEE Transactions on Dependable and Secure Computing.

[47]  Xuejun Yang,et al.  FTPA: Supporting Fault-Tolerant Parallel Computing through Parallel Recomputing , 2009, IEEE Transactions on Parallel and Distributed Systems.

[48]  Michael Wooldridge,et al.  Introduction to multiagent systems , 2001 .