Symmetric Active/Active High Availability for High-Performance Computing System Services

This work aims to pave the way for high availability in high-performance computing (HPC) by focusing on efficient redundancy strategies for head and service nodes. These nodes represent single points of failure and control for an entire HPC system as they render it inaccessible and unmanageable in case of a failure until repair. The presented approach introduces two distinct replication methods, internal and external, for providing symmetric active/active high availability for multiple redundant head and service nodes running in virtual synchrony utilizing an existing process group communication system for service group membership management and reliable, totally ordered message delivery. Resented results of a prototype implementation that offers symmetric active/active replication for HPC job and resource management using external replication show that the highest level of availability can be provided with an acceptable performance trade-off.

[1]  Idit Keidar,et al.  Group communication specifications: a comprehensive study , 2001, CSUR.

[2]  Jack J. Dongarra,et al.  Fault Tolerant MPI for the HARNESS Meta-computing System , 2001, International Conference on Computational Science.

[3]  David P. Anderson,et al.  SETI@home-massively distributed computing for SETI , 2001, Comput. Sci. Eng..

[4]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[5]  James Arthur Kohl,et al.  Harness: Adaptable Virtual Machine Environment for Heterogeneous Clusters , 1999, Parallel Process. Lett..

[6]  Luís E. T. Rodrigues,et al.  An indulgent uniform total order algorithm with optimistic delivery , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[7]  Xubin He,et al.  Design of a high performance and high availability distributed storage system , 2006 .

[8]  Michael T. Heath,et al.  Scientific Computing , 2018 .

[9]  Stephen L. Scott,et al.  Evaluation of fault-tolerant policies using simulation , 2007, 2007 IEEE International Conference on Cluster Computing.

[10]  Mario Lauria,et al.  CSAR: cluster storage with adaptive redundancy , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[11]  Christian Engelmann,et al.  A diskless checkpointing algorithm for super-scale architectures applied to the fast fourier transform , 2003, Proceedings of the International Workshop on Challenges of Large Applications in Distributed Environments, 2003..

[12]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[13]  G. A. Geist,et al.  High Availability through Distributed Control , 2004 .

[14]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[15]  John A. Gunnels,et al.  Simulating solidification in metals at high pressure: The drive to petascale computing , 2006 .

[16]  David E. Bernholdt,et al.  MOLAR: adaptive runtime support for high-end computing operating and runtime systems , 2006, OPSR.

[17]  Andrew Lumsdaine,et al.  A Component Architecture for LAM/MPI , 2003, PVM/MPI.

[18]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[19]  Louise E. Moser,et al.  The Totem single-ring ordering and membership protocol , 1995, TOCS.

[20]  Danny Dolev,et al.  Early delivery totally ordered multicast in asynchronous environments , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[21]  Christian Engelmann,et al.  JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management , 2006, 2006 IEEE International Conference on Cluster Computing.

[22]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[23]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[24]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[25]  Xubin He,et al.  A Fast Delivery Protocol for Total Order Broadcasting , 2007, 2007 16th International Conference on Computer Communications and Networks.

[26]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[27]  Pedro Pla Drbd in a heartbeat , 2006 .

[28]  Kenneth P. Birman,et al.  Performance of the ISIS Distributed Computing Toolkit , 1994 .

[29]  Daniel J. Palermo,et al.  Enhancing an Open Source Resource Manager with Multi-core/Multi-threaded Support , 2007, JSSPP.

[30]  Xubin He,et al.  Symmetric Active/Active Replication for Dependent Services , 2008, 2008 Third International Conference on Availability, Reliability and Security.

[31]  Danny Dolev,et al.  The Design of the Transis System , 1994, Dagstuhl Seminar on Distributed Systems.

[32]  Sam Toueg,et al.  A Modular Approach to Fault-Tolerant Broadcasts and Related Problems , 1994 .

[33]  Christian Engelmann,et al.  Concepts for High Availability in Scientific High-End Computing , 2005 .

[34]  Thomas Sterling,et al.  How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters 2nd Printing , 1999 .

[35]  Israel Koren,et al.  Fault-Tolerant Systems , 2007 .

[36]  James Arthur Kohl,et al.  HARNESS: a next generation distributed virtual machine , 1999, Future Gener. Comput. Syst..

[37]  Louise E. Moser,et al.  A reliable ordered delivery protocol for interconnected local area networks , 1995, Proceedings of International Conference on Network Protocols.

[38]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[39]  Courtenay T. Vaughan,et al.  Extending catamount for multi-core processors. , 2007 .

[40]  Robbert van Renesse,et al.  Reliable Distributed Computing with the Isis Toolkit , 1994 .

[41]  William Gropp,et al.  Beowulf Cluster Computing with Linux , 2003 .

[42]  Nancy A. Lynch,et al.  Early-Delivery Dynamic Atomic Broadcast , 2002, DISC.

[43]  Sean Landis,et al.  Building Reliable Distributed Systems with CORBA , 1997, Theory Pract. Object Syst..

[44]  Jack J. Dongarra,et al.  HARNESS and fault tolerant MPI , 2001, Parallel Comput..

[45]  Priya Narasimhan,et al.  Thema: Byzantine-fault-tolerant middleware for Web-service applications , 2005, 24th IEEE Symposium on Reliable Distributed Systems (SRDS'05).

[46]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[47]  Christian Engelmann,et al.  Distributed Peer-to-Peer Control in Harness , 2002, International Conference on Computational Science.

[48]  Salim Hariri,et al.  Tools and Environments for Parallel and Distributed Computing , 2004 .

[49]  Miguel Castro,et al.  BASE: using abstraction to improve fault tolerance , 2001, SOSP.

[50]  Robert P. Goldberg,et al.  Architecture of virtual machines , 1973, Workshop on Virtual Computer Systems.

[51]  Danny Dolev,et al.  The Transis approach to high availability cluster communication , 1996, CACM.

[52]  Silvano Maffeis,et al.  The Object Group Design Pattern , 1996, COOTS.

[53]  David P. Anderson,et al.  BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[54]  Louise E. Moser,et al.  Extended virtual synchrony , 1994, 14th International Conference on Distributed Computing Systems.

[55]  Philipp Reisner,et al.  Replicated Storage with Shared Disk Semantics , 2007 .

[56]  Christian Engelmann,et al.  Super-Scalable Algorithms for Computing on 100, 000 Processors , 2005, International Conference on Computational Science.

[57]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[58]  Robbert van Renesse,et al.  The Amoeba distributed operating system - A status report , 1991, Comput. Commun..

[59]  Ira Pramanick,et al.  High Availability , 2001, Int. J. High Perform. Comput. Appl..

[60]  Suzanne M. Kelly,et al.  Software Architecture of the Light Weight Kernel, Catamount , 2005 .

[61]  Hari Balakrishnan,et al.  Tolerating byzantine faults in transaction processing systems using commit barrier scheduling , 2007, SOSP.

[62]  Christian Engelmann,et al.  Job-Site Level Fault Tolerance for Cluster and Grid environments , 2005, 2005 IEEE International Conference on Cluster Computing.

[63]  Leslie Lamport,et al.  Using Time Instead of Timeout for Fault-Tolerant Distributed Systems. , 1984, TOPL.

[64]  Wu-chun Feng,et al.  A Power-Aware Run-Time System for High-Performance Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[65]  Robbert van Renesse,et al.  Design and Performance of Horus: A Lightweight Group Communications System , 1994 .

[66]  Laurent Lefèvre,et al.  T2CP-AR: A system for Transparent TCP Active Replication , 2007, 21st International Conference on Advanced Information Networking and Applications (AINA '07).

[67]  Andrew S. Tanenbaum,et al.  An evaluation of the Amoeba group communication system , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[68]  D. B. Davis,et al.  Sun Microsystems Inc. , 1993 .

[69]  Christian Engelmann,et al.  Active/active replication for highly available HPC system services , 2006, First International Conference on Availability, Reliability and Security (ARES'06).

[70]  Xubin He,et al.  Transparent Symmetric Active/Active Replication for Service-Level High Availability , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).