Fast Log Replication in Highly Available Data Store

Modern large-scale data stores widely adopt consensus protocols to achieve high availability and throughput. The recently proposed Raft algorithm has better understandability and widely implemented in large amount of open source projects. In these consensus algorithms including Raft, log replication is a common and frequently used operation which has significant impact on the system performance. Especially, since the commit latency is capped by the slowest follower out of the majority followers responded to the leader, it’s important to design a fast scheme to process the replicated logs by follower nodes. Based on the analysis on how the follower node handles the received log entries in Raft algorithm, we figure out the main factors influencing the duration time from when the follower receives the log and to when it acknowledges the leader this log was received. In terms of these factors we propose an effective log replication scheme to optimize the process of flushing logs to disk and replaying them, referred to as Raft with Fast Followers (FRaft). Finally, we compare the performance of Raft and FRaft using YCSB benchmark and Sysbench test tools, and experimental results demonstrate FRaft has lower latency and higher throughput than the Raft only using straightforward pipeline and batch optimization for log replication.

[1]  Barbara Liskov,et al.  Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems , 1999, PODC '88.

[2]  David Kenneth Gifford,et al.  Information storage in a decentralized computer system , 1981 .

[3]  Haifeng Jiang,et al.  Photon: fault-tolerant and scalable joining of continuous data streams , 2013, SIGMOD '13.

[4]  Pnina Vortman,et al.  High throughput reliable message dissemination , 2004, SAC '04.

[5]  Jun Rao,et al.  Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore , 2011, Proc. VLDB Endow..

[6]  Roy Friedman,et al.  Adaptive Batching for Replicated Servers , 2006, 2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06).

[7]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[8]  Alberto Bartoli,et al.  Adaptive Message Packing for Group Communication Systems , 2003, OTM Workshops.

[9]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[10]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[11]  David G. Andersen,et al.  Paxos Quorum Leases: Fast Reads Without Sacrificing Writes , 2014, SoCC.

[12]  Peng Li,et al.  Paxos Replicated State Machines as the Basis of a High-Performance Data Store , 2011, NSDI.

[13]  Michael Stonebraker,et al.  Concurrency Control and Consistency of Multiple Copies of Data in Distributed Ingres , 1979, IEEE Transactions on Software Engineering.

[14]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[15]  Flavio Paiva Junqueira,et al.  Zab: High-performance broadcast for primary-backup systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[16]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[17]  Roy Friedman,et al.  Packing messages as a tool for boosting the performance of total ordering protocols , 1997, Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183).

[18]  André Schiper,et al.  Tuning Paxos for High-Throughput with Batching and Pipelining , 2012, ICDCN.

[19]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[20]  Keith Marzullo,et al.  Mencius: Building Efficient Replicated State Machine for WANs , 2008, OSDI.