Multi-Version Coding—An Information-Theoretic Perspective of Consistent Distributed Storage

In applications of distributed storage systems to distributed computing and implementation of key-value stores, the following property, usually referred to as consistency in distributed computing, is an important requirement: as the data stored changes, the latest version of the data must be accessible to a client that connects to the storage system. Motivated by technological trends where key-value stores are increasingly implemented in high-speed memory, an information theoretic formulation called multi-version coding is introduced in this paper in order to understand and minimize the memory overhead of consistent distributed storage. Multi-version coding is characterized by <inline-formula> <tex-math notation="LaTeX">$\nu$ </tex-math></inline-formula> totally ordered versions of a message and a storage system with <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> servers. At each server, values corresponding to an arbitrary subset of the <inline-formula> <tex-math notation="LaTeX">$\nu$ </tex-math></inline-formula> versions are received and encoded. For any subset of <inline-formula> <tex-math notation="LaTeX">$c$ </tex-math></inline-formula> servers in the storage system, the value corresponding to the latest common version or a later version, as per the total ordering, among the <inline-formula> <tex-math notation="LaTeX">$c$ </tex-math></inline-formula> servers is required to be decodable. An achievable multi-version code construction via linear coding and a converse result that shows that the construction is asymptotically tight when <inline-formula> <tex-math notation="LaTeX">$\nu |(c-1)$ </tex-math></inline-formula> are provided. An implication of the converse is that there is an inevitable price, in terms of storage cost, to ensure consistency in distributed storage systems.

[1]  Randall R. Stewart,et al.  Stream Control Transmission Protocol , 2000, RFC.

[2]  Maurice Herlihy,et al.  The Art of Multiprocessor Programming, Revised Reprint , 2012 .

[3]  Kannan Ramchandran,et al.  EC-Cache: Load-Balanced, Low-Latency Cluster Caching with Online Erasure Coding , 2016, OSDI.

[4]  Hagit Attiya,et al.  Sharing memory robustly in message-passing systems , 1990, PODC '90.

[5]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[6]  Nancy A. Lynch,et al.  A Layered Architecture for Erasure-Coded Consistent Distributed Storage , 2017, PODC.

[7]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[8]  Christina Fragouli,et al.  On Pliable Index Coding , 2019, ArXiv.

[9]  Cheng Huang,et al.  Giza: Erasure Coding Objects across Global Data Centers , 2017, USENIX Annual Technical Conference.

[10]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[11]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[12]  Christina Fragouli,et al.  Content-type coding , 2015, 2015 International Symposium on Network Coding (NetCod).

[13]  Vijay K. Garg,et al.  Fault tolerance in distributed systems using fused state machines , 2013, Distributed Computing.

[14]  Nancy A. Lynch,et al.  A coded shared atomic memory algorithm for message passing architectures , 2014, 2014 IEEE 13th International Symposium on Network Computing and Applications.

[15]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[16]  Marcos K. Aguilera,et al.  Using erasure codes efficiently for storage in a distributed system , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[17]  Nihar B. Shah,et al.  Fundamental limits on communication for oblivious updates in storage networks , 2014, 2014 IEEE Global Communications Conference.

[18]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[19]  Zhiying Wang,et al.  On multi-version coding for distributed storage , 2014, 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[20]  Nancy A. Lynch,et al.  Storage-Optimized Data-Atomic Algorithms for Handling Erasures and Errors in Distributed Storage Systems , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[21]  Nancy A. Lynch,et al.  Information-Theoretic Lower Bounds on the Storage Cost of Shared Memory Emulation , 2016, PODC.

[22]  Ghassan O. Karame,et al.  PoWerStore: proofs of writing for efficient and robust storage , 2012, CCS.

[23]  Alexandros G. Dimakis,et al.  Network Coding for Distributed Storage Systems , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[24]  Nancy A. Lynch,et al.  RADON: Repairable Atomic Data Object in Networks , 2016, OPODIS.

[25]  Michael K. Reiter,et al.  Fault-scalable Byzantine fault-tolerant services , 2005, SOSP '05.

[26]  Frédérique E. Oggier,et al.  Compressed Differential Erasure Codes for Efficient Archival of Versioned Data , 2015, ArXiv.

[27]  Stefano Tessaro,et al.  Optimal Resilience for Erasure-Coded Byzantine Distributed Storage , 2005, International Conference on Dependable Systems and Networks (DSN'06).

[28]  Rachid Guerraoui,et al.  Optimistic Erasure-Coded Distributed Storage , 2008, DISC.

[29]  Jeff Carpenter,et al.  Cassandra: The Definitive Guide , 2010 .

[30]  Michael K. Reiter,et al.  Efficient Byzantine-tolerant erasure-coded storage , 2004, International Conference on Dependable Systems and Networks, 2004.

[31]  Muriel Médard,et al.  Communication Cost for Updating Linear Functions When Message Updates are Sparse: Connections to Maximally Recoverable Codes , 2018, IEEE Transactions on Information Theory.

[32]  J. Chris Anderson,et al.  CouchDB: The Definitive Guide , 2010 .

[33]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[34]  Heng Zhang,et al.  Efficient and Available In-Memory KV-Store with Hybrid Erasure Coding and Replication , 2016, FAST.

[35]  Viveck R. Cadambe,et al.  Consistent distributed storage of correlated data updates via multi-version coding , 2016, 2016 IEEE Information Theory Workshop (ITW).

[36]  Chao Tian Characterizing the Rate Region of the (4,3,3) Exact-Repair Regenerating Codes , 2014, IEEE Journal on Selected Areas in Communications.

[37]  S. Nash,et al.  Linear and Nonlinear Optimization , 2008 .

[38]  Gregory W. Wornell,et al.  Update efficient codes for error correction , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[39]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[40]  Arif Merchant,et al.  FAB: building distributed enterprise disk arrays from commodity components , 2004, ASPLOS XI.

[41]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[42]  Michael K. Reiter,et al.  Low-overhead byzantine fault-tolerant storage , 2007, SOSP.

[43]  Han Mao Kiah,et al.  Synchronizing edits in distributed storage networks , 2014, 2015 IEEE International Symposium on Information Theory (ISIT).

[44]  Nancy A. Lynch,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[45]  David Mosberger,et al.  Memory consistency models , 1993, OPSR.

[46]  Werner Vogels,et al.  Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability. , 2022 .

[47]  Sriram Vishwanath,et al.  Update efficient codes for distributed storage , 2011, 2011 IEEE International Symposium on Information Theory Proceedings.

[48]  Nancy A. Lynch,et al.  Hierarchical correctness proofs for distributed algorithms , 1987, PODC '87.