A highly-available move operation for replicated trees

Replicated tree data structures are a fundamental building block of distributed filesystems, such as Google Drive and Dropbox, and collaborative applications with a JSON or XML data model. These systems need to support a move operation that allows a subtree to be moved to a new location within the tree. However, such a move operation is difficult to implement correctly if different replicas can concurrently perform arbitrary move operations, and we demonstrate bugs in Google Drive and Dropbox that arise with concurrent moves. In this article we present a CRDT algorithm that handles arbitrary concurrent modifications on trees, while ensuring that the tree structure remains valid (in particular, no cycles are introduced), and guaranteeing that all replicas converge towards the same consistent state. Our algorithm requires no synchronous coordination between replicas, making it highly available in the face of network partitions. We formally prove the correctness of our algorithm using the Isabelle/HOL proof assistant, and evaluate the performance of our formally verified implementation in a geo-replicated setting.

[1]  Marc Shapiro,et al.  A coordination-free, convergent, and safe replicated tree , 2021, ArXiv.

[2]  Tobias Nipkow,et al.  Code Generation via Higher-Order Rewrite Systems , 2010, FLOPS.

[3]  Ali Ghodsi,et al.  Eventual consistency today: limitations, extensions, and beyond , 2013, CACM.

[4]  Hongseok Yang,et al.  'Cause I'm strong enough: Reasoning about consistency choices in distributed systems , 2016, POPL.

[5]  Tim Jungnickel,et al.  Simultaneous editing of JSON objects via operational transformation , 2016, SAC.

[6]  Marc Shapiro,et al.  Conflict-Free Replicated Data Types , 2011, SSS.

[7]  Pascal Urso,et al.  Abstract unordered and ordered trees CRDT , 2011, ArXiv.

[8]  Mahadev Satyanarayanan,et al.  Log-based directory resolution in the Coda file system , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.

[9]  Chengzheng Sun,et al.  Operational transformation in real-time group editors: issues, algorithms, and achievements , 1998, CSCW '98.

[10]  Mahadev Satyanarayanan,et al.  Disconnected Operation in the Coda File System , 1999, Mobidata.

[11]  Junwei Lu,et al.  Generalizing operational transformation to the standard general markup language , 2002, CSCW '02.

[12]  Claudia-Lavinia Ignat,et al.  Customizable Collaborative Editor Relying on treeOPT Algorithm , 2003, ECSCW.

[13]  Peter Bailis,et al.  The network is reliable , 2014, Commun. ACM.

[14]  Alastair R. Beresford,et al.  A Conflict-Free Replicated JSON Datatype , 2016, IEEE Transactions on Parallel and Distributed Systems.

[15]  Ali Ghodsi,et al.  Coordination Avoidance in Database Systems , 2014, Proc. VLDB Endow..

[16]  Martin Kleppmann,et al.  Moving elements in list CRDTs , 2020, PaPoC@EuroSys.

[17]  Patrick Th. Eugster,et al.  Co-Design and Verification of an Available File System , 2018, VMCAI.

[18]  Hamid Pirahesh,et al.  ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging , 1998 .

[19]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[20]  Yasushi Saito,et al.  Optimistic replication , 2005, CSUR.

[21]  John S. Heidemann,et al.  Resolving File Conflicts in the Ficus File System , 1994, USENIX Summer.

[22]  Hongseok Yang,et al.  The CISE tool: proving weakly-consistent applications correct , 2016, PaPoC@EuroSys.

[23]  Joonwon Lee,et al.  Parallel Distrib , 2022 .

[24]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.

[25]  Wolfgang De Meuter,et al.  Putting Order in Strong Eventual Consistency , 2019, DAIS.

[26]  Benjamin C. Pierce,et al.  What's in Unison? A Formal Specification and Reference Implementation of a File Synchronizer , 2004 .

[27]  Marc Shapiro,et al.  Merging semantics for conflict updates in geo-distributed file systems , 2015, SYSTOR.

[28]  Alastair R. Beresford,et al.  Verifying strong eventual consistency in distributed systems , 2017, Proc. ACM Program. Lang..

[29]  Peter Lammich,et al.  The Isabelle Collections Framework , 2010, ITP.

[30]  Marvin Theimer,et al.  Managing update conflicts in Bayou, a weakly connected replicated storage system , 1995, SOSP.

[31]  Samer Al-Kiswany,et al.  An Analysis of Network-Partitioning Failures in Cloud Systems , 2018, OSDI.

[32]  Joan Manuel Marquès,et al.  A Commutative Replicated Data Type for Cooperative Editing , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.

[33]  Stéphane Weiss,et al.  Scalable XML Collaborative Editing with Undo - (Short Paper) , 2010, OTM Conferences.

[34]  Tobias Nipkow,et al.  Concrete Semantics: With Isabelle/HOL , 2014 .

[35]  Marc Shapiro,et al.  Consistency in 3D , 2016, CONCUR.

[36]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[37]  Mahsa Najafzadeh The Analysis and Co-design of Weakly-Consistent Applications. (Analyse et co-conception d'applications faiblement cohérentes) , 2016 .

[38]  Tobias Nipkow,et al.  The Isabelle Framework , 2008, TPHOLs.

[39]  Sebastian Burckhardt,et al.  Principles of Eventual Consistency , 2014, Found. Trends Program. Lang..

[40]  Achour Mostéfaoui,et al.  LSEQ: an adaptive structure for sequences in distributed collaborative editing , 2013, ACM Symposium on Document Engineering.

[41]  Ulf Norell,et al.  Mysteries of DropBox: Property-Based Testing of a Distributed Synchronization Service , 2016, 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST).

[42]  Nikolaj Bjørner,et al.  Models and Software Model Checking of a Distributed File Replication System , 2007, Formal Methods and Hybrid Real-Time Systems.

[43]  Mahadev Satyanarayanan,et al.  Flexible and Safe Resolution of File Conflicts , 1995, USENIX.

[44]  Peter L. Reiher,et al.  Roam: a scalable replication system for mobile computing , 1999, Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99.

[45]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[46]  Ali Shoker,et al.  Making operation-based CRDTs operation-based , 2014, PaPEC '14.

[47]  Pascal Molli,et al.  Logoot-Undo: Distributed Collaborative Editing System on P2P Networks , 2010, IEEE Transactions on Parallel and Distributed Systems.

[48]  Peter L. Reiher,et al.  Rumor: Mobile Data Access Through Optimistic Peer-to-Peer Replication , 1998, ER Workshops.

[49]  John S. Heidemann,et al.  Implementation of the Ficus Replicated File System , 1990, USENIX Summer.

[50]  Hala Skaf-Molli,et al.  Using the transformational approach to build a safe and generic data synchronizer , 2003, GROUP.

[51]  David R. Jefferson,et al.  Virtual time , 1985, ICPP.

[52]  Pascal Urso,et al.  File system on CRDT , 2012, ArXiv.

[53]  Sebastian Burckhardt,et al.  Replicated data types: specification, verification, optimality , 2014, POPL.