Exploiting Nil-Externality for Fast Replicated Storage

Do some storage interfaces enable higher performance than others? Can one identify and exploit such interfaces to realize high performance in storage systems? This paper answers these questions in the affirmative by identifying nil-externality, a property of storage interfaces. A nil-externalizing (nilext) interface may modify state within a storage system but does not externalize its effects or system state immediately to the outside world. As a result, a storage system can apply nilext operations lazily, improving performance. In this paper, we take advantage of nilext interfaces to build high-performance replicated storage. We implement Skyros, a nilext-aware replication protocol that offers high performance by deferring ordering and executing operations until their effects are externalized. We show that exploiting nil-externality offers significant benefit: for many workloads, Skyros provides higher performance than standard consensus-based replication. For example, Skyros offers 3x lower latency while providing the same high throughput offered by throughput-optimized Paxos.

[1]  Fernando Pedone,et al.  P4xos: Consensus as a Network Service , 2020, IEEE/ACM Transactions on Networking.

[2]  Satoshi Matsushita,et al.  Implementing linearizability at large scale and low latency , 2015, SOSP.

[3]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[4]  Wyatt Lloyd,et al.  Gryff: Unifying Consensus and Shared Registers , 2020, NSDI.

[5]  Siying Dong,et al.  MyRocks , 2020, Proc. VLDB Endow..

[6]  Michael Williams,et al.  Replication in the harp file system , 1991, SOSP '91.

[7]  Fernando Pedone,et al.  NetPaxos: consensus at network speed , 2015, SOSR.

[8]  Jialin Li,et al.  Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering , 2016, OSDI.

[9]  Ramakrishna Kotla,et al.  Zyzzyva: speculative byzantine fault tolerance , 2007, TOCS.

[10]  Bernard Wong,et al.  Domino: using network measurements to reduce state machine replication latency in WANs , 2020, CoNEXT.

[11]  Michael J. Freedman,et al.  Who's Afraid of Uncorrectable Bit Errors? Online Recovery of Flash Errors with Distributed Redundancy , 2019, USENIX Annual Technical Conference.

[12]  Mendel Rosenblum,et al.  It's Time for Low Latency , 2011, HotOS.

[13]  Michael A. Bender,et al.  BetrFS: A Right-Optimized Write-Optimized File System , 2015, FAST.

[14]  Jialin Li,et al.  Designing Distributed Systems Using Approximate Synchrony in Data Center Networks , 2015, NSDI.

[15]  Gustavo Alonso,et al.  Processing transactions over optimistic atomic broadcast protocols , 1999, Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003).

[16]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[17]  Marcos K. Aguilera,et al.  Consistency-based service level agreements for cloud storage , 2013, SOSP.

[18]  Barbara Liskov,et al.  Granola: Low-Overhead Distributed Transaction Coordination , 2012, USENIX Annual Technical Conference.

[19]  Gerth Stølting Brodal,et al.  Lower bounds for external memory dictionaries , 2003, SODA '03.

[20]  Keith Marzullo,et al.  Mencius: Building Efficient Replicated State Machine for WANs , 2008, OSDI.

[21]  John K. Ousterhout,et al.  Exploiting Commutativity For Practical Fast Replication , 2017, NSDI.

[22]  Russel Sandberg,et al.  The Sun Network Filesystem: Design, Implementation and Experience , 2001 .

[23]  Murat Demirbas,et al.  Linearizable Quorum Reads in Paxos , 2019, HotStorage.

[24]  Jason Flinn,et al.  Tolerating Latency in Replicated State Machines Through Client Speculation , 2009, NSDI.

[25]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[26]  Liuba Shrira,et al.  HQ replication: a hybrid quorum protocol for byzantine fault tolerance , 2006, OSDI '06.

[27]  Christoph Koch,et al.  Quantum Databases , 2013, CIDR.

[28]  Michael A. Bender,et al.  An Introduction to Bε-trees and Write-Optimization , 2015, login Usenix Mag..

[29]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[30]  Jinyang Li,et al.  Consolidating Concurrency Control and Consensus for Commits under Conflicts , 2016, OSDI.

[31]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[32]  Michael Stonebraker,et al.  Implementation techniques for main memory database systems , 1984, SIGMOD '84.

[33]  Robert B. Hagmann,et al.  Reimplementing the Cedar file system using logging and group commit , 1987, SOSP '87.

[34]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[35]  Robbert van Renesse,et al.  Leveraging sharding in the design of scalable replication protocols , 2013, SoCC.

[36]  Lorenzo Alvisi,et al.  I Can't Believe It's Not Causal! Scalable Causal Consistency with No Slowdown Cascades , 2017, NSDI.

[37]  Ramakrishna Kotla,et al.  Zyzzyva , 2007, SOSP.

[38]  Fred B. Schneider,et al.  The primary-backup approach , 1993 .

[39]  André Schiper,et al.  Handling message semantics with Generic Broadcast protocols , 2002, Distributed Computing.

[40]  Richard P. Spillane,et al.  SplinterDB: Closing the Bandwidth Gap for NVMe Key-Value Stores , 2020, USENIX Annual Technical Conference.

[41]  Dhiraj K. Pradhan,et al.  Roll-Forward and Rollback Recovery: Performance-Reliability Trade-Off , 1997, IEEE Trans. Computers.

[42]  Paul Hudak,et al.  Conception, evolution, and application of functional programming languages , 1989, CSUR.

[43]  Cheng Li,et al.  Making geo-replicated systems fast as possible, consistent when necessary , 2012, OSDI 2012.

[44]  Yang Wang,et al.  All about Eve: Execute-Verify Replication for Multi-Core Servers , 2012, OSDI.

[45]  Michael A. Bender,et al.  The TokuFS Streaming File System , 2012, HotStorage.

[46]  David R. Cheriton,et al.  UIO: a uniform I/O system interface for distributed systems , 1987, TOCS.

[47]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[48]  Dong Zhou,et al.  Rex: replication at the speed of multi-core , 2014, EuroSys '14.

[49]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[50]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[51]  David G. Andersen,et al.  There is more consensus in Egalitarian parliaments , 2013, SOSP.

[52]  Mao Yang,et al.  PacificA: Replication in Log-Based Distributed Storage Systems , 2008 .

[53]  Austin T. Clements,et al.  The scalable commutativity rule: designing scalable software for multicore processors , 2013, SOSP.

[54]  Hamid Pirahesh,et al.  ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging , 1998 .

[55]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[56]  Leslie Lamport,et al.  Generalized Consensus and Paxos , 2005 .

[57]  Daniel J. Abadi,et al.  Lazy evaluation of transactions in database systems , 2014, SIGMOD Conference.

[58]  Zhichao Cao,et al.  Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook , 2020, FAST.

[59]  Jason Flinn,et al.  Rethink the sync , 2006, OSDI '06.

[60]  Peng Li,et al.  Paxos Replicated State Machines as the Basis of a High-Performance Data Store , 2011, NSDI.

[61]  Andrea C. Arpaci-Dusseau,et al.  Strong and Efficient Consistency with Consistency-aware Durability , 2021, FAST.

[62]  K. V. Rashmi,et al.  A large scale analysis of hundreds of in-memory cache clusters at Twitter , 2020, OSDI.

[63]  Arvind Krishnamurthy,et al.  Building consistent transactions with inconsistent replication , 2015, SOSP.

[64]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[65]  Fernando Pedone,et al.  DynaStar: Optimized Dynamic Partitioning for Scalable State Machine Replication , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).