Coordination Avoidance in Distributed Databases

The rise of Internet-scale geo-replicated services has led to upheaval in the design of modern data management systems. Given the availability, latency, and throughput penalties associated with classic mechanisms such as serializable transactions, a broad class of systems (e.g., "NoSQL") has sought weaker alternatives that reduce the use of expensive coordination during system operation, often at the cost of application integrity. When can we safely forego the cost of this expensive coordination, and when must we pay the price?In this thesis, we investigate the potential for coordination avoidance---the use of as little coordination as possible while ensuring application integrity---in several modern data-intensive domains. We demonstrate how to leverage the semantic requirements of applications in data serving, transaction processing, and web services to enable more efficient distributed algorithms and system designs. The resulting prototype systems demonstrate regular order-of-magnitude speedups compared to their traditional, coordinated counterparts on a variety of tasks, including referential integrity and index maintenance, transaction execution under common isolation models, and database constraint enforcement. A range of open source applications and systems exhibit similar results.

[1]  Jennifer Widom,et al.  Behavior of database production rules: termination, confluence, and observable determinism , 1992, SIGMOD '92.

[2]  Michael K. Reiter,et al.  Probabilistic quorum systems , 1997, PODC '97.

[3]  Sean Hull 20 obstacles to scalability , 2013, CACM.

[4]  Rajiv Ranjan,et al.  Streaming Big Data Processing in Datacenter Clouds , 2014, IEEE Cloud Computing.

[5]  Xiaozhou Li,et al.  Analyzing consistency properties for fun and profit , 2011, PODC '11.

[6]  Chen-Nee Chuah,et al.  Characterization of Failures in an Operational IP Backbone Network , 2008, IEEE/ACM Transactions on Networking.

[7]  Stephen Travis Pope,et al.  A Description of the Model-View-Controller User Interface Paradigm in the Smalltalk-80 System , 1998 .

[8]  Sebastian Burckhardt,et al.  Eventually Consistent Transactions , 2012, ESOP.

[9]  V. N. Venkatakrishnan,et al.  WAVES: Automatic Synthesis of Client-Side Validation Code for Web Applications , 2012, 2012 International Conference on Cyber Security.

[10]  Sérgio Duarte,et al.  Write Fast, Read in the Past: Causal Consistency for Client-Side Applications , 2015, Middleware.

[11]  Bettina Kemme,et al.  Real-time quantification and classification of consistency anomalies in multi-tier architectures , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[12]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[13]  Ambuj K. Singh,et al.  Consistency and orderability: semantics-based correctness criteria for databases , 1993, TODS.

[14]  Philip A. Bernstein,et al.  A vision for management of complex models , 2000, SGMD.

[15]  Leslie Lamport,et al.  Consensus on transaction commit , 2004, TODS.

[16]  Divyakant Agrawal,et al.  Using multiversion data for non-interfering execution of write-only transactions , 1991, SIGMOD '91.

[17]  Henry F. Korth,et al.  Formal model of correctness without serializabilty , 1988, SIGMOD '88.

[18]  William H. Sanders,et al.  An Adaptive Quality of Service Aware Middleware for Replicated Services , 2003, IEEE Trans. Parallel Distributed Syst..

[19]  Jerzy Brzezinski,et al.  From session causality to causal consistency , 2004, 12th Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2004. Proceedings..

[20]  B. R. Badrinath,et al.  Multiversion reconciliation for mobile databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[21]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.

[22]  Alvin Cheung,et al.  Automatic Partitioning of Database Applications , 2012, Proc. VLDB Endow..

[23]  Thomas E. Anderson,et al.  F10: A Fault-Tolerant Engineered Network , 2013, NSDI.

[24]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[25]  Stanley B. Zdonik,et al.  Object-Oriented Type Evolution. , 1987 .

[26]  Irving L. Traiger,et al.  Transactions and consistency in distributed database systems , 1982, TODS.

[27]  Gustavo Alonso,et al.  Basic Web Services Technology , 2004 .

[28]  Timothy G. Armstrong,et al.  LinkBench: a database benchmark based on the Facebook social graph , 2013, SIGMOD '13.

[29]  Michael J. Freedman,et al.  Stronger Semantics for Low-Latency Geo-Replicated Storage , 2013, NSDI.

[30]  Bettina Kemme,et al.  How consistent is your cloud application? , 2012, SoCC '12.

[31]  E. Brewer,et al.  CAP twelve years later: How the "rules" have changed , 2012, Computer.

[32]  Carlo Curino,et al.  OLTP-Bench: An Extensible Testbed for Benchmarking Relational Databases , 2013, Proc. VLDB Endow..

[33]  Ali Ghodsi,et al.  Scalable atomic visibility with RAMP transactions , 2014, SIGMOD Conference.

[34]  Yasushi Saito,et al.  Optimistic replication , 2005, CSUR.

[35]  Stefan Savage,et al.  California fault lines: understanding the causes and impact of network failures , 2010, SIGCOMM '10.

[36]  Tevfik Bultan,et al.  Bounded verification of Ruby on Rails data models , 2011, ISSTA '11.

[37]  François Llirbat,et al.  Using Versions in Update Transactions: Application to Integrity Checking , 1997, VLDB.

[38]  Moni Naor,et al.  The Load, Capacity, and Availability of Quorum Systems , 1998, SIAM J. Comput..

[39]  Michael Stonebraker,et al.  The Case for Shared Nothing , 1985, HPTS.

[40]  Paul W. P. J. Grefen,et al.  Integrity Control in Relational Database Systems - An Overview , 1993, Data Knowl. Eng..

[41]  Jim Gray,et al.  A critique of ANSI SQL isolation levels , 1995, SIGMOD '95.

[42]  Muhammad Ali Babar,et al.  The use of empirical methods in Open Source Software research: Facts, trends and future directions , 2009, 2009 ICSE Workshop on Emerging Trends in Free/Libre/Open Source Software Research and Development.

[43]  David J. DeWitt,et al.  Of Objects and Databases: A Decade of Turmoil , 1996, VLDB.

[44]  Arthur J. Bernstein,et al.  Transaction decomposition using transaction semantics , 1996, Distributed and Parallel Databases.

[45]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[46]  Liuba Shrira,et al.  Providing high availability using lazy replication , 1992, TOCS.

[47]  Barbara Liskov,et al.  Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions , 1999 .

[48]  Martin Fowler,et al.  Patterns of Enterprise Application Architecture , 2002 .

[49]  David Thomas,et al.  Agile Web Development with Rails 4 , 2013, The pragmatic programmers.

[50]  Philip A. Bernstein,et al.  Site Initialization, Recovery, and Backup in a Distributed Database System , 1984, IEEE Transactions on Software Engineering.

[51]  David Geer Will software developers ride Ruby on Rails to success? , 2006, Computer.

[52]  Fernando Pedone,et al.  P-Store: Genuine Partial Replication in Wide Area Networks , 2010, 2010 29th IEEE Symposium on Reliable Distributed Systems.

[53]  Kenneth Salem,et al.  Lazy database replication with ordering guarantees , 2004, Proceedings. 20th International Conference on Data Engineering.

[54]  Hector Garcia-Molina,et al.  Consistency in a partitioned network: a survey , 1985, CSUR.

[55]  Ali Ghodsi,et al.  HAT, Not CAP: Towards Highly Available Transactions , 2013, HotOS.

[56]  Cheng Li,et al.  Making geo-replicated systems fast as possible, consistent when necessary , 2012, OSDI 2012.

[57]  Bettina Kemme,et al.  Database replication for clusters of workstations , 2000 .

[58]  Marvin Theimer,et al.  Flexible update propagation for weakly consistent replication , 1997, SOSP.

[59]  Jennifer Widom,et al.  Approximate replication , 2003 .

[60]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[61]  Frank Neven,et al.  Relational transducers for declarative networking , 2013, J. ACM.

[62]  Rada Chirkova,et al.  Materialized Views , 2012, Found. Trends Databases.

[63]  Farnam Jahanian,et al.  Experimental study of Internet stability and backbone failures , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[64]  Michael I. Jordan,et al.  The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox , 2014, CIDR.

[65]  K. Perreault,et al.  Research Design: Qualitative, Quantitative, and Mixed Methods Approaches , 2011 .

[66]  Patrick Valduriez,et al.  Principles of Distributed Database Systems, Third Edition , 2011 .

[67]  Bowen Alpern,et al.  Defining Liveness , 1984, Inf. Process. Lett..

[68]  Ali Ghodsi,et al.  The potential dangers of causal consistency and an explicit solution , 2012, SoCC '12.

[69]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[70]  C. Mohan History repeats itself: sensible and NonsenSQL aspects of the NoSQL hoopla , 2013, EDBT '13.

[71]  Kevin Lee,et al.  Data Consistency Properties and the Trade-offs in Commercial Cloud Storage: the Consumers' Perspective , 2011, CIDR.

[72]  Brian D. Noble,et al.  Bobtail: Avoiding Long Tails in the Cloud , 2013, NSDI.

[73]  Hector Garcia-Molina,et al.  Using semantic knowledge for transaction processing in a distributed database , 1983, TODS.

[74]  Nam Huyn,et al.  Maintaining Global Integrity Constraints in Distributed Databases , 2004, Constraints.

[75]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[76]  David Toman,et al.  Logics for Databases and Information Systems , 1998 .

[77]  Joseph M. Hellerstein,et al.  Consistency Analysis in Bloom: a CALM and Collected Approach , 2011, CIDR.

[78]  Rachid Guerraoui,et al.  Laws of order: expensive synchronization in concurrent algorithms cannot be eliminated , 2011, POPL '11.

[79]  Peter J. Stuckey,et al.  Observable Confluence for Constraint Handling Rules , 2007, ICLP.

[80]  Butler W. Lampson,et al.  Atomic Transactions , 1980, Advanced Course: Distributed Systems.

[81]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[82]  Eric A. Brewer,et al.  Towards robust distributed systems (abstract) , 2000, PODC '00.

[83]  Ali Ghodsi,et al.  Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity , 2015, SIGMOD Conference.

[84]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[85]  Arvola Chan,et al.  Implementing Distributed Read-Only Transactions , 1985, IEEE Transactions on Software Engineering.

[86]  V. N. Venkatakrishnan,et al.  CAVEAT: Facilitating interactive and secure client-side validators for ruby on rails applications , 2013, SECURWARE 2013.

[87]  Avik Chaudhuri,et al.  Symbolic security analysis of ruby-on-rails web applications , 2010, CCS '10.

[88]  Jennifer Widom,et al.  Active Database Systems: Triggers and Rules For Advanced Database Processing , 1994 .

[89]  Zheng Zhang,et al.  Trading replication consistency for performance and availability: an adaptive approach , 2003, 23rd International Conference on Distributed Computing Systems, 2003. Proceedings..

[90]  H. T. Kung,et al.  An optimality theory of concurrency control for databases , 2004, Acta Informatica.

[91]  Daniel J. Abadi,et al.  Consistency Tradeoffs in Modern Distributed Database System Design: CAP is Only Part of the Story , 2012, Computer.

[92]  Marc Shapiro,et al.  A comprehensive study of Convergent and Commutative Replicated Data Types , 2011 .

[93]  Tim O'Reilly,et al.  What is Web 2.0: Design Patterns and Business Models for the Next Generation of Software , 2007 .

[94]  Daniel J. Abadi,et al.  Lazy evaluation of transactions in database systems , 2014, SIGMOD Conference.

[95]  S. Savage,et al.  On Failure in Managed Enterprise Networks , 2012 .

[96]  Michael J. Freedman,et al.  Don't settle for eventual: scalable causal consistency for wide-area storage with COPS , 2011, SOSP.

[97]  Carlo Curino,et al.  Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems , 2012, SIGMOD Conference.

[98]  Gunter Saake,et al.  Logics for databases and information systems , 1998 .

[99]  Ali Ghodsi,et al.  Coordination Avoidance in Database Systems , 2014, Proc. VLDB Endow..

[100]  Ali Ghodsi,et al.  Eventual Consistency Today: Limitations, Extensions, and Beyond , 2013 .

[101]  Patrick Valduriez,et al.  Transaction chopping: algorithms and performance studies , 1995, TODS.

[102]  David J. DeWitt,et al.  Shoring up persistent applications , 1994, SIGMOD '94.

[103]  Herodotos Herodotou,et al.  Massively Parallel Databases and MapReduce Systems , 2013, Found. Trends Databases.

[104]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[105]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[106]  Rajeev Rastogi,et al.  On correctness of non-serializable executions , 1993, PODS '93.

[107]  Amin Vahdat,et al.  Design and evaluation of a conit-based continuous consistency model for replicated services , 2002, TOCS.

[108]  Ion Stoica,et al.  Quantifying eventual consistency with PBS , 2014, CACM.

[109]  Ion Stoica,et al.  PBS at work: advancing data management with consistency metrics , 2013, SIGMOD '13.

[110]  Alexander Aiken,et al.  Concurrent data representation synthesis , 2012, PLDI.

[111]  Irving L. Traiger,et al.  Granularity of Locks and Degrees of Consistency in a Shared Data Base , 1998, IFIP Working Conference on Modelling in Data Base Management Systems.

[112]  Marvin Theimer,et al.  Session guarantees for weakly consistent replicated data , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[113]  Divyakant Agrawal,et al.  Relative Serializbility: An Approach for Relaxing the Atomicity of Transactions. , 1994, PODS 1994.

[114]  Stephen Travis Pope,et al.  A cookbook for using the model-view controller user interface paradigm in Smalltalk-80 , 1988 .

[115]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[116]  Philip A. Bernstein,et al.  Concurrency control in a system for distributed databases (SDD-1) , 1980, TODS.

[117]  Doug Terry,et al.  Epidemic algorithms for replicated database maintenance , 1988, OPSR.

[118]  Jack A. Orenstein,et al.  The ObjectStore database system , 1991, CACM.

[119]  Marcos K. Aguilera,et al.  Transactional storage for geo-replicated systems , 2011, SOSP.

[120]  Philip A. Bernstein,et al.  Rethinking eventual consistency , 2013, SIGMOD '13.

[121]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[122]  Shiyong Lu,et al.  Correct execution of transactions at different isolation levels , 2004, IEEE Transactions on Knowledge and Data Engineering.

[123]  Daniel J. Abadi,et al.  Lightweight Locking for Main Memory Database Systems , 2012, Proc. VLDB Endow..

[124]  Tim Kraska,et al.  Building a database on S3 , 2008, SIGMOD Conference.

[125]  Ali Ghodsi,et al.  Bolt-on causal consistency , 2013, SIGMOD '13.

[126]  Tevfik Bultan,et al.  Inductive verification of data model invariants for web applications , 2014, ICSE.

[127]  William E. Weihl,et al.  SPECIFICATION AND IMPLEMENTATION OF ATOMIC DATA TYPES , 1984 .

[128]  David Zhang,et al.  On brewing fresh espresso: LinkedIn's distributed data serving platform , 2013, SIGMOD '13.

[129]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[130]  Daniela E. Damian,et al.  Selecting Empirical Methods for Software Engineering Research , 2008, Guide to Advanced Empirical Software Engineering.

[131]  Gil Neiger,et al.  Causal memory: definitions, implementation, and programming , 1995, Distributed Computing.

[132]  Divyakant Agrawal,et al.  G-Store: a scalable data store for transactional multi key access in the cloud , 2010, SoCC '10.

[133]  Fan Yang,et al.  Hilda: A High-Level Language for Data-DrivenWeb Applications , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[134]  David Maier,et al.  Blazes: Coordination analysis for distributed programs , 2013, 2014 IEEE 30th International Conference on Data Engineering.

[135]  Irving L. Traiger,et al.  The notions of consistency and predicate locks in a database system , 1976, CACM.

[136]  Eddie Kohler,et al.  The scalable commutativity rule , 2017, Commun. ACM.

[137]  Maurice Herlihy,et al.  The art of multiprocessor programming , 2020, PODC '06.

[138]  Daniel J. Abadi,et al.  Low overhead concurrency control for partitioned main memory databases , 2010, SIGMOD Conference.

[139]  Alvin Cheung,et al.  StatusQuo: Making Familiar Abstractions Perform Using Program Analysis , 2013, CIDR.

[140]  Brian F. Cooper Spanner: Google's globally-distributed database , 2013, SYSTOR '13.

[141]  João Leitão,et al.  Automating the Choice of Consistency Levels in Replicated Systems , 2014, USENIX Annual Technical Conference.

[142]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[143]  Frank Wm. Tompa,et al.  Efficiently updating materialized views , 1986, SIGMOD '86.

[144]  David Maier,et al.  Logic and lattices for distributed programming , 2012, SoCC '12.

[145]  Lorenzo Alvisi,et al.  Consistency , Availability , and Convergence , 2011 .

[146]  Yi Lin,et al.  Snapshot isolation and integrity constraints in replicated databases , 2009, TODS.

[147]  Nancy A. Lynch,et al.  Eventually-serializable data services , 1996, PODC '96.

[148]  Frank Dabek,et al.  Large-scale Incremental Processing Using Distributed Transactions and Notifications , 2010, OSDI.

[149]  Daniel J. Abadi,et al.  Calvin: fast distributed transactions for partitioned database systems , 2012, SIGMOD Conference.

[150]  Hicham G. Elmongui,et al.  Lazy Maintenance of Materialized Views , 2007, VLDB.

[151]  Ion Stoica,et al.  Probabilistically Bounded Staleness for Practical Partial Quorums , 2012, Proc. VLDB Endow..

[152]  Dave Thomas,et al.  Agile Web Development with Rails , 2005 .

[153]  Michael I. Jordan,et al.  Asynchronous Complex Analytics in a Distributed Dataflow Architecture , 2015, ArXiv.

[154]  Joseph M. Hellerstein,et al.  Consistency without borders , 2013, SoCC.

[155]  Nancy A. Lynch,et al.  Atomic Transactions: In Concurrent and Distributed Systems , 1993 .

[156]  M. Tamer Özsu,et al.  Using semantic knowledge of transactions to increase concurrency , 1989, TODS.

[157]  Johannes Gehrke,et al.  The Homeostasis Protocol: Avoiding Transaction Coordination Through Program Analysis , 2014, SIGMOD Conference.

[158]  Hagit Attiya,et al.  Distributed Computing: Fundamentals, Simulations and Advanced Topics , 1998 .

[159]  Jim Gray,et al.  The Transaction Concept: Virtues and Limitations (Invited Paper) , 1981, VLDB.

[160]  Jan Willem Klop,et al.  Term Rewriting Systems: From Church-Rosser to Knuth-Bendix and Beyond , 1990, ICALP.

[161]  Ippokratis Pandis,et al.  Eliminating unscalable communication in transaction processing , 2013, The VLDB Journal.

[162]  Ali Ghodsi,et al.  Highly Available Transactions: Virtues and Limitations , 2013, Proc. VLDB Endow..

[163]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[164]  Werner Vogels,et al.  Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability. , 2022 .

[165]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[166]  Michael Stonebraker,et al.  H-store: a high-performance, distributed main memory transaction processing system , 2008, Proc. VLDB Endow..

[167]  Philip A. Bernstein,et al.  Reverse engineering models from databases to bootstrap application development , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[168]  Ian Rae,et al.  F1: A Distributed SQL Database That Scales , 2013, Proc. VLDB Endow..

[169]  Alan Fekete,et al.  Quantifying Isolation Anomalies , 2009, Proc. VLDB Endow..

[170]  Eddie Kohler,et al.  Speedy transactions in multicore in-memory databases , 2013, SOSP.

[171]  Dennis Shasha,et al.  Making snapshot isolation serializable , 2005, TODS.

[172]  Robbert van Renesse,et al.  Toward a cloud computing research agenda , 2009, SIGA.

[173]  Yawei Li,et al.  Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.

[174]  Jeffrey Dean,et al.  Designs, Lessons and Advice from Building Large Distributed Systems , 2009 .

[175]  Leslie Lamport,et al.  Proving the Correctness of Multiprocess Programs , 1977, IEEE Transactions on Software Engineering.

[176]  Fred B. Schneider On Concurrent Programming , 1997, Graduate Texts in Computer Science.

[177]  Jennifer Widom,et al.  Local verification of global integrity constraints in distributed databases , 1993, SIGMOD '93.

[178]  Patrick E. O'Neil,et al.  The Escrow transactional method , 1986, TODS.

[179]  Leslie Lamport Towards a theory of correctness of multi-user database systems , 1976 .

[180]  David Bermbach,et al.  Eventual consistency: How soon is eventual? An evaluation of Amazon S3's consistency behavior , 2011, MW4SOC '11.

[181]  Pat Helland,et al.  Life beyond Distributed Transactions: an Apostate's Opinion , 2007, CIDR.

[182]  Bernadette Charron-Bost,et al.  Concerning the Size of Logical Clocks in Distributed Systems , 1991, Inf. Process. Lett..