Total order broadcast for fault tolerant exascale systems

In the process of designing a new fault tolerant run-time for future exascale systems, we discovered that a total order broadcast would be necessary. That is, nodes of a supercomputer should be able to broadcast messages to other nodes even in the face of failures. All messages should be seen in the same order at all nodes. While this is a well studied problem in distributed systems, few researchers have looked at how to perform total order broadcasts at large scales for data availability. Our experience implementing a published total order broadcast algorithm showed poor scalability at tens of nodes. In this paper we present a novel algorithm for total order broadcast which scales logarithmically in the number of processes and is not delayed by most process failures. While we are motivated by the needs of our run-time we believe this primitive is of general applicability. Total order broadcasts are used often in datacenter environments and as HPC developers begins to address fault tolerance at the application level we believe they will need similar primitives.

[1]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[2]  Richard L. Graham,et al.  Preserving Collective Performance across Process Failure for a Fault Tolerant MPI , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[3]  Greg Bronevetsky,et al.  Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance , 2011, EuroMPI.

[4]  Darius Buntinas Scalable Distributed Consensus to Support MPI Fault Tolerance , 2011, EuroMPI.

[5]  Michael Isard,et al.  Autopilot: automatic data center management , 2007, OPSR.

[6]  Dilma Da Silva,et al.  Enabling autonomic behavior in systems software with hot swapping , 2003, IBM Syst. J..

[7]  Brian F. Cooper Spanner: Google's globally-distributed database , 2013, SYSTOR '13.

[8]  Sape J. Mullender,et al.  Distributed systems (2nd Ed.) , 1993 .

[9]  Thomas Hérault,et al.  An evaluation of User-Level Failure Mitigation support in MPI , 2012, Computing.

[10]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[11]  Dilma Da Silva,et al.  K42: building a complete operating system , 2006, EuroSys.

[12]  George Bosilca,et al.  Algorithmic Based Fault Tolerance Applied to High Performance Computing , 2008, ArXiv.

[13]  Yawei Li,et al.  Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.

[14]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[15]  George Bosilca,et al.  Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..

[16]  Benjamin Reed,et al.  A simple totally ordered broadcast protocol , 2008, LADIS '08.

[17]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[18]  Fred B. Schneider,et al.  Replication management using the state-machine approach , 1993 .