Symmetric Active/Active High Availability for High-Performance Computing System Services: Accomplishments and Limitations

This paper summarizes our efforts over the last 3-4 years in providing symmetric active/active high availability for high-performance computing (HPC) system services. This work paves the way for high-level reliability, availability and serviceability in extreme-scale HPC systems by focusing on the most critical components, head and service nodes, and by reinforcing them with appropriate high availability solutions. This paper presents our accomplishments in the form of concepts and respective prototypes, discusses existing limitations, outlines possible future work, and describes the relevance of this research to other, planned efforts.

[1]  Christian Engelmann,et al.  JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management , 2006, 2006 IEEE International Conference on Cluster Computing.

[2]  Rafael M. Gasca,et al.  Towards a Dependable Architecture for Highly Available Internet Services , 2008, 2008 Third International Conference on Availability, Reliability and Security.

[3]  Danny Dolev,et al.  Early delivery totally ordered multicast in asynchronous environments , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[4]  Xubin He,et al.  Symmetric Active/Active Replication for Dependent Services , 2008, 2008 Third International Conference on Availability, Reliability and Security.

[5]  Terry Jones,et al.  HPC System Call Usage Trends , 2007 .

[6]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[7]  Jack J. Dongarra,et al.  Fault Tolerant MPI for the HARNESS Meta-computing System , 2001, International Conference on Computational Science.

[8]  Xin Chen,et al.  Symmetric active/active metadata service for high availability parallel file systems , 2009, J. Parallel Distributed Comput..

[9]  Thomas Hérault,et al.  MPI tools and performance studies - Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI , 2006, SC.

[10]  Ramakrishna Kotla,et al.  Zyzzyva , 2007, SOSP.

[11]  T. Inglett,et al.  Designing a Highly-Scalable Operating System: The Blue Gene/L Story , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[12]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[13]  Luís E. T. Rodrigues,et al.  An indulgent uniform total order algorithm with optimistic delivery , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[14]  Luís Moura Silva,et al.  An experimental study about diskless checkpointing , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).

[15]  John A. Gunnels,et al.  Simulating solidification in metals at high pressure: The drive to petascale computing , 2006 .

[16]  Xubin He,et al.  A Fast Delivery Protocol for Total Order Broadcasting , 2007, 2007 16th International Conference on Computer Communications and Networks.

[17]  Jack J. Dongarra,et al.  Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing , 1997, J. Parallel Distributed Comput..

[18]  James Arthur Kohl,et al.  HARNESS: a next generation distributed virtual machine , 1999, Future Gener. Comput. Syst..

[19]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[20]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[21]  Andrew Lumsdaine,et al.  A Component Architecture for LAM/MPI , 2003, PVM/MPI.

[22]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[23]  Louise E. Moser,et al.  The Totem single-ring ordering and membership protocol , 1995, TOCS.

[24]  Leslie Lamport,et al.  Using Time Instead of Timeout for Fault-Tolerant Distributed Systems. , 1984, TOPL.

[25]  Wu-chun Feng,et al.  A Power-Aware Run-Time System for High-Performance Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[26]  Robbert van Renesse,et al.  Design and Performance of Horus: A Lightweight Group Communications System , 1994 .

[27]  Laurent Lefèvre,et al.  T2CP-AR: A system for Transparent TCP Active Replication , 2007, 21st International Conference on Advanced Information Networking and Applications (AINA '07).

[28]  Andrew Robertson The Evolution of the Linux-HA Project , 2004 .

[29]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[30]  Hong Ong,et al.  Middleware in Modern High Performance Computing System Architectures , 2007, International Conference on Computational Science.

[31]  Silvano Maffeis,et al.  The Object Group Design Pattern , 1996, COOTS.

[32]  David P. Anderson,et al.  BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[33]  Louise E. Moser,et al.  Extended virtual synchrony , 1994, 14th International Conference on Distributed Computing Systems.

[34]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[35]  William Gropp,et al.  Beowulf Cluster Computing with Linux , 2003 .

[36]  Christoph Kreitz,et al.  Building reliable, high-performance communication systems from components , 2000, OPSR.

[37]  Nancy A. Lynch,et al.  Early-Delivery Dynamic Atomic Broadcast , 2002, DISC.

[38]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[39]  Heather M. Quinn,et al.  Terrestrial-based radiation upsets: a cautionary tale , 2005, 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05).

[40]  Sean Landis,et al.  Building Reliable Distributed Systems with CORBA , 1997, Theory Pract. Object Syst..

[41]  C. Leangsuksun,et al.  Asymmetric Active-Active High Availability for High-end Computing , 2005 .

[42]  Claudiu Danilov,et al.  The Spread Toolkit: Architecture and Performance , 2004 .

[43]  Srinidhi Varadarajan,et al.  DejaVu: transparent user-level checkpointing, migration and recovery for distributed systems , 2006, SC.

[44]  James E. Smith,et al.  The architecture of virtual machines , 2005, Computer.

[45]  Robbert van Renesse,et al.  The Amoeba distributed operating system - A status report , 1991, Comput. Commun..

[46]  Hagit Attiya,et al.  Distributed computing - fundamentals, simulations, and advanced topics (2. ed.) , 2004, Wiley series on parallel and distributed computing.

[47]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[48]  Louise E. Moser,et al.  A reliable ordered delivery protocol for interconnected local area networks , 1995, Proceedings of International Conference on Network Protocols.

[49]  Philipp Reisner,et al.  Replicated Storage with Shared Disk Semantics , 2007 .

[50]  Xubin He,et al.  On Programming Models for Service-Level High Availability , 2007, The Second International Conference on Availability, Reliability and Security (ARES'07).

[51]  Christian Engelmann,et al.  Super-Scalable Algorithms for Computing on 100, 000 Processors , 2005, International Conference on Computational Science.

[52]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[53]  Dhabaleswar K. Panda,et al.  Benefits of high speed interconnects to cluster file systems: a case study with Lustre , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[54]  Priya Narasimhan,et al.  Thema: Byzantine-fault-tolerant middleware for Web-service applications , 2005, 24th IEEE Symposium on Reliable Distributed Systems (SRDS'05).

[55]  Kenneth P. Birman,et al.  Performance of the ISIS Distributed Computing Toolkit , 1994 .

[56]  Daniel J. Palermo,et al.  Enhancing an Open Source Resource Manager with Multi-core/Multi-threaded Support , 2007, JSSPP.

[57]  G. A. Geist,et al.  High Availability through Distributed Control , 2004 .

[58]  Michael K. Reiter,et al.  Low-overhead byzantine fault-tolerant storage , 2007, SOSP.

[59]  A. Singh,et al.  Fault-tolerant systems , 1990, Computer.

[60]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[61]  Zizhong Chen,et al.  Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[62]  Gustavo Alonso,et al.  Using Optimistic Atomic Broadcast in Transaction Processing Systems , 2003, IEEE Trans. Knowl. Data Eng..

[63]  Courtenay T. Vaughan,et al.  Extending catamount for multi-core processors. , 2007 .

[64]  Sheng-Kai Hung,et al.  Modularized Redundant Parallel Virtual File System , 2005, Asia-Pacific Computer Systems Architecture Conference.

[65]  Christian Engelmann,et al.  Symmetric Active/Active High Availability for High-Performance Computing System Services , 2006, J. Comput..

[66]  Robbert van Renesse,et al.  Reliable Distributed Computing with the Isis Toolkit , 1994 .

[67]  Andrew S. Tanenbaum,et al.  An evaluation of the Amoeba group communication system , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[68]  Christian Engelmann,et al.  Active/active replication for highly available HPC system services , 2006, First International Conference on Availability, Reliability and Security (ARES'06).

[69]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[70]  James Arthur Kohl,et al.  Harness: Adaptable Virtual Machine Environment for Heterogeneous Clusters , 1999, Parallel Process. Lett..

[71]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[72]  Christian Engelmann,et al.  Symmetric active/active metadata service for highly available cluster storage systems , 2007 .

[73]  Suzanne M. Kelly,et al.  Catamount Software Architecture with Dual Core Extensions , 2005 .

[74]  Christian Engelmann,et al.  High Availability for Ultra-Scale High-End Scientific Computing , 2008 .

[75]  Xubin He,et al.  Transparent Symmetric Active/Active Replication for Service-Level High Availability , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[76]  Pedro Pla Drbd in a heartbeat , 2006 .

[77]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[78]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[79]  Christian Engelmann,et al.  Distributed Peer-to-Peer Control in Harness , 2002, International Conference on Computational Science.

[80]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[81]  Jack J. Dongarra,et al.  HARNESS and fault tolerant MPI , 2001, Parallel Comput..

[82]  Richard E. Harper,et al.  A Case for High Availability in a Virtualized Environment (HAVEN) , 2008, 2008 Third International Conference on Availability, Reliability and Security.

[83]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[84]  Christian Engelmann,et al.  Job-Site Level Fault Tolerance for Cluster and Grid environments , 2005, 2005 IEEE International Conference on Cluster Computing.

[85]  Sam Toueg,et al.  A Modular Approach to Fault-Tolerant Broadcasts and Related Problems , 1994 .

[86]  Christian Engelmann,et al.  Concepts for High Availability in Scientific High-End Computing , 2005 .

[87]  David E. Bernholdt,et al.  MOLAR: adaptive runtime support for high-end computing operating and runtime systems , 2006, OPSR.

[88]  Salim Hariri,et al.  Tools and Environments for Parallel and Distributed Computing , 2004 .

[89]  Miguel Castro,et al.  Using abstraction to improve fault tolerance , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[90]  Miguel Castro,et al.  BASE: using abstraction to improve fault tolerance , 2001, SOSP.

[91]  Danny Dolev,et al.  The Transis approach to high availability cluster communication , 1996, CACM.

[92]  Suzanne M. Kelly,et al.  Software Architecture of the Light Weight Kernel, Catamount , 2005 .

[93]  W. Vogels,et al.  The Horus and Ensemble projects: accomplishments and limitations , 2000, Proceedings DARPA Information Survivability Conference and Exposition. DISCEX'00.

[94]  Hari Balakrishnan,et al.  Tolerating byzantine faults in transaction processing systems using commit barrier scheduling , 2007, SOSP.

[95]  Idit Keidar,et al.  Group communication specifications: a comprehensive study , 2001, CSUR.

[96]  Roberto Baldoni,et al.  Total Order Communications: A Practical Analysis , 2005, EDCC.

[97]  Kees Verstoep,et al.  Group communication in Amoeba and its applications , 1993, Distributed Syst. Eng..

[98]  David P. Anderson,et al.  SETI@home-massively distributed computing for SETI , 2001, Comput. Sci. Eng..

[99]  Sheng-Kai Hung,et al.  DPCT: Distributed Parity Cache Table for Redundant Parallel File System , 2006, HPCC.

[100]  Stephen L. Scott,et al.  Evaluation of fault-tolerant policies using simulation , 2007, 2007 IEEE International Conference on Cluster Computing.

[101]  Mario Lauria,et al.  CSAR: cluster storage with adaptive redundancy , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[102]  Christian Engelmann,et al.  A diskless checkpointing algorithm for super-scale architectures applied to the fast fourier transform , 2003, Proceedings of the International Workshop on Challenges of Large Applications in Distributed Environments, 2003..

[103]  Sarah Ellen Michalak,et al.  Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[104]  Danny Dolev,et al.  The Design of the Transis System , 1994, Dagstuhl Seminar on Distributed Systems.

[105]  Benjamin Ray Seyfarth,et al.  How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters , 2000, Scalable Comput. Pract. Exp..

[106]  Kenneth P. Birman,et al.  Reliable Distributed Systems: Technologies, Web Services, and Applications , 2005 .

[107]  Xubin He,et al.  Design of a high performance and high availability distributed storage system , 2006 .