Paxos Replicated State Machines as the Basis of a High-Performance Data Store

Conventional wisdom holds that Paxos is too expensive to use for high-volume, high-throughput, data-intensive applications. Consequently, fault-tolerant storage systems typically rely on special hardware, semantics weaker than sequential consistency, a limited update interface (such as append-only), primary-backup replication schemes that serialize all reads through the primary, clock synchronization for correctness, or some combination thereof. We demonstrate that a Paxos-based replicated state machine implementing a storage service can achieve performance close to the limits of the underlying hardware while tolerating arbitrary machine restarts, some permanent machine or disk failures and a limited set of Byzantine faults. We also compare it with two versions of primary-backup. The replicated state machine can serve as the data store for a file system or storage array. We present a novel algorithm for ensuring read consistency without logging, along with a sketch of a proof of its correctness.

[1]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[2]  Jon Howell,et al.  The SMART way to migrate replicated stateful services , 2006, EuroSys.

[3]  Chandramohan A. Thekkath,et al.  Petal: distributed virtual disks , 1996, ASPLOS VII.

[4]  Rajeev Nagar,et al.  Windows NT File System Internals , 1997 .

[5]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[6]  B. M. Oki,et al.  VIEWSTAMPED REPLICATION FOR HIGHLY AVAILABLE DISTRIBUTED SYSTEMS , 1988 .

[7]  Mahadev Satyanarayanan,et al.  Scale and performance in a distributed file system , 1987, SOSP '87.

[8]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[9]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[10]  Kenneth P. Birman,et al.  Reliable Distributed Systems: Technologies, Web Services, and Applications , 2005 .

[11]  Arif Merchant,et al.  FAB: building distributed enterprise disk arrays from commodity components , 2004, ASPLOS XI.

[12]  Yale N. Patt,et al.  Scheduling algorithms for modern disk drives , 1994, SIGMETRICS 1994.

[13]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[14]  A. M. Lister,et al.  Fundamentals of Operating Systems , 1984, Springer New York.

[15]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[16]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[17]  Antony I. T. Rowstron,et al.  Everest: Scaling Down Peak Loads Through I/O Off-Loading , 2008, OSDI.

[18]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[19]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[20]  Butler W. Lampson,et al.  The ABCD's of Paxos , 2001, PODC '01.

[21]  Jeanna Neefe Matthews,et al.  Serverless network file systems , 1996, TOCS.

[22]  Andrea C. Arpaci-Dusseau,et al.  An analysis of data corruption in the storage stack , 2008, TOS.

[23]  Liuba Shrira,et al.  HQ replication: a hybrid quorum protocol for byzantine fault tolerance , 2006, OSDI '06.

[24]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[25]  John H. Hartman,et al.  The Zebra striped network file system , 1995, TOCS.

[26]  John R. Douceur,et al.  Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs , 2011, EuroSys '11.

[27]  Marvin Theimer,et al.  The Bayou Architecture: Support for Data Sharing Among Mobile Users , 1994, 1994 First Workshop on Mobile Computing Systems and Applications.

[28]  Marc Najork,et al.  Boxwood: Abstractions as the Foundation for Storage Infrastructure , 2004, OSDI.

[29]  Rajeev Nagar,et al.  Windows NT file system internals - a developer's guide: building NT file system drivers , 1997 .

[30]  Arun Venkataramani,et al.  Separating agreement from execution for byzantine fault tolerant services , 2003, SOSP '03.

[31]  Shivakumar Venkataraman,et al.  The TickerTAIP parallel RAID architecture , 1993, ISCA '93.

[32]  Ramakrishna Kotla,et al.  Zyzzyva: speculative byzantine fault tolerance , 2007, TOCS.

[33]  Michael K. Reiter,et al.  Fault-scalable Byzantine fault-tolerant services , 2005, SOSP '05.

[34]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[35]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[36]  GhemawatSanjay,et al.  The Google file system , 2003 .

[37]  Kalen Delaney,et al.  Microsoft SQL Server 2008 Internals , 2009 .

[38]  Nikolai Joukov,et al.  A nine year study of file system and storage benchmarking , 2008, TOS.

[39]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.