论文信息 - Paxos Replicated State Machines as the Basis of a High-Performance Data Store - 字舞流文

Paxos Replicated State Machines as the Basis of a High-Performance Data Store

Conventional wisdom holds that Paxos is too expensive to use for high-volume, high-throughput, data-intensive applications. Consequently, fault-tolerant storage systems typically rely on special hardware, semantics weaker than sequential consistency, a limited update interface (such as append-only), primary-backup replication schemes that serialize all reads through the primary, clock synchronization for correctness, or some combination thereof. We demonstrate that a Paxos-based replicated state machine implementing a storage service can achieve performance close to the limits of the underlying hardware while tolerating arbitrary machine restarts, some permanent machine or disk failures and a limited set of Byzantine faults. We also compare it with two versions of primary-backup. The replicated state machine can serve as the data store for a file system or storage array. We present a novel algorithm for ensuring read consistency without logging, along with a sketch of a proof of its correctness.

Peng Li | William J. Bolosky | Dexter P. Bradshaw | Dexter Bradshaw | Randolph B. Haagens | Norbert P. Kusters | W. Bolosky | Peng Li | R. Haagens

[1] Fred B. Schneider,et al. Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[2] Jon Howell,et al. The SMART way to migrate replicated stateful services , 2006, EuroSys.

[3] Chandramohan A. Thekkath,et al. Petal: distributed virtual disks , 1996, ASPLOS VII.

[4] Rajeev Nagar,et al. Windows NT File System Internals , 1997 .

[5] Leslie Lamport,et al. The part-time parliament , 1998, TOCS.

[6] B. M. Oki,et al. VIEWSTAMPED REPLICATION FOR HIGHLY AVAILABLE DISTRIBUTED SYSTEMS , 1988 .

[7] Mahadev Satyanarayanan,et al. Scale and performance in a distributed file system , 1987, SOSP '87.

[8] Mendel Rosenblum,et al. The design and implementation of a log-structured file system , 1991, SOSP '91.

[9] Robert Griesemer,et al. Paxos made live: an engineering perspective , 2007, PODC '07.

[10] Kenneth P. Birman,et al. Reliable Distributed Systems: Technologies, Web Services, and Applications , 2005 .

[11] Arif Merchant,et al. FAB: building distributed enterprise disk arrays from commodity components , 2004, ASPLOS XI.

[12] Yale N. Patt,et al. Scheduling algorithms for modern disk drives , 1994, SIGMETRICS 1994.

[13] R. M. Tomasulo,et al. An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[14] A. M. Lister,et al. Fundamentals of Operating Systems , 1984, Springer New York.

[15] Leslie Lamport,et al. The Byzantine Generals Problem , 1982, TOPL.

[16] Leslie Lamport,et al. Paxos Made Simple , 2001 .

[17] Antony I. T. Rowstron,et al. Everest: Scaling Down Peak Loads Through I/O Off-Loading , 2008, OSDI.

[18] Andreas Reuter,et al. Transaction Processing: Concepts and Techniques , 1992 .

[19] Miguel Oom Temudo de Castro,et al. Practical Byzantine fault tolerance , 1999, OSDI '99.

[20] Butler W. Lampson,et al. The ABCD's of Paxos , 2001, PODC '01.

[21] Jeanna Neefe Matthews,et al. Serverless network file systems , 1996, TOCS.

[22] Andrea C. Arpaci-Dusseau,et al. An analysis of data corruption in the storage stack , 2008, TOS.

[23] Liuba Shrira,et al. HQ replication: a hybrid quorum protocol for byzantine fault tolerance , 2006, OSDI '06.

[24] Ben Y. Zhao,et al. OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[25] John H. Hartman,et al. The Zebra striped network file system , 1995, TOCS.

[26] John R. Douceur,et al. Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs , 2011, EuroSys '11.

[27] Marvin Theimer,et al. The Bayou Architecture: Support for Data Sharing Among Mobile Users , 1994, 1994 First Workshop on Mobile Computing Systems and Applications.

[28] Marc Najork,et al. Boxwood: Abstractions as the Foundation for Storage Infrastructure , 2004, OSDI.

[29] Rajeev Nagar,et al. Windows NT file system internals - a developer's guide: building NT file system drivers , 1997 .

[30] Arun Venkataramani,et al. Separating agreement from execution for byzantine fault tolerant services , 2003, SOSP '03.

[31] Shivakumar Venkataraman,et al. The TickerTAIP parallel RAID architecture , 1993, ISCA '93.

[32] Ramakrishna Kotla,et al. Zyzzyva: speculative byzantine fault tolerance , 2007, TOCS.

[33] Michael K. Reiter,et al. Fault-scalable Byzantine fault-tolerant services , 2005, SOSP '05.

[34] Eduardo Pinheiro,et al. DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[35] Brett D. Fleisch,et al. The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[36] GhemawatSanjay,et al. The Google file system , 2003 .

[37] Kalen Delaney,et al. Microsoft SQL Server 2008 Internals , 2009 .

[38] Nikolai Joukov,et al. A nine year study of file system and storage benchmarking , 2008, TOS.

[39] Seif Haridi,et al. Distributed Algorithms , 1992, Lecture Notes in Computer Science.