CockroachDB: The Resilient Geo-Distributed SQL Database

We live in an increasingly interconnected world, with many organizations operating across countries or even continents. To serve their global user base, organizations are replacing their legacy DBMSs with cloud-based systems capable of scaling OLTP workloads to millions of users. CockroachDB is a scalable SQL DBMS that was built from the ground up to support these global OLTP workloads while maintaining high availability and strong consistency. Just like its namesake, CockroachDB is resilient to disasters through replication and automatic recovery mechanisms. This paper presents the design of CockroachDB and its novel transaction model that supports consistent geo-distributed transactions on commodity hardware. We describe how CockroachDB replicates and distributes data to achieve fault tolerance and high performance, as well as how its distributed SQL layer automatically scales with the size of the database cluster while providing the standard SQL interface that users expect. Finally, we present a comprehensive performance evaluation and share a couple of case studies of CockroachDB users. We conclude by describing lessons learned while building CockroachDB over the last five years.

[1]  Rachid Guerraoui,et al.  How Fast can a Distributed Transaction Commit? , 2017, PODS.

[2]  Divyakant Agrawal,et al.  Global-Scale Placement of Transactional Data Stores , 2018, EDBT.

[3]  Ali Ghodsi,et al.  Bolt-on causal consistency , 2013, SIGMOD '13.

[4]  Ian Rae,et al.  F1: A Distributed SQL Database That Scales , 2013, Proc. VLDB Endow..

[5]  Jon Howell,et al.  Slicer: Auto-Sharding for Datacenter Applications , 2016, OSDI.

[6]  Gang Chen,et al.  Towards a Non-2PC Transaction Management in Distributed Database Systems , 2016, SIGMOD Conference.

[7]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[8]  Divyakant Agrawal,et al.  Low-Latency Multi-Datacenter Databases using Replicated Commit , 2013, Proc. VLDB Endow..

[9]  Goetz Graefe The Cascades Framework for Query Optimization , 1995, IEEE Data Eng. Bull..

[10]  Arif Merchant,et al.  Take me to your leader! Online Optimization of Distributed Storage Configurations , 2015, Proc. VLDB Endow..

[11]  Michael Stonebraker,et al.  The Case for Shared Nothing , 1985, HPTS.

[12]  Michael Stonebraker,et al.  The VoltDB Main Memory DBMS , 2013, IEEE Data Eng. Bull..

[13]  Michael Stonebraker,et al.  Clay: Fine-Grained Adaptive Partitioning for General Database Schemas , 2016, Proc. VLDB Endow..

[14]  Jeong-Hyon Hwang,et al.  Wide area placement of data replicas for fast and highly available data access , 2011, DIDC '11.

[15]  Willy Zwaenepoel,et al.  Wren: Nonblocking Reads in a Partitioned Transactional Causally Consistent Data Store , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[16]  Michael J. Cahill Serializable isolation for snapshot databases , 2009, TODS.

[17]  Alexander Shraer,et al.  Dynamic Reconfiguration of Primary/Backup Clusters , 2012, USENIX Annual Technical Conference.

[18]  Daniel J. Abadi,et al.  Calvin: fast distributed transactions for partitioned database systems , 2012, SIGMOD Conference.

[19]  Lei Gao,et al.  Serving large-scale batch computed data with project Voldemort , 2012, FAST.

[20]  Steven Feuerstein,et al.  Oracle PL/SQL Programming, 4th Edition , 2005 .

[21]  Peter Bailis,et al.  ACIDRain: Concurrency-Related Attacks on Database-Backed Web Applications , 2017, SIGMOD Conference.

[22]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[23]  Tim Brecht,et al.  Carousel: Low-Latency Transaction Processing for Globally-Distributed Data , 2018, SIGMOD Conference.

[24]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[25]  Michael Stonebraker,et al.  H-store: a high-performance, distributed main memory transaction processing system , 2008, Proc. VLDB Endow..

[26]  Divyakant Agrawal,et al.  G-Store: a scalable data store for transactional multi key access in the cloud , 2010, SoCC '10.

[27]  Feifei Li,et al.  Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage , 2018, USENIX Annual Technical Conference.

[28]  Tim Kraska,et al.  MDCC: multi-data center consistency , 2012, EuroSys '13.

[29]  Michael Stonebraker,et al.  E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing , 2014, Proc. VLDB Endow..

[30]  Misha Tyulenev,et al.  Implementation of Cluster-wide Logical Clock and Causal Consistency in MongoDB , 2019, SIGMOD Conference.

[31]  Wojciech Golab,et al.  Ocean Vista: Gossip-Based Visibility Control for Speedy Geo-Distributed Transactions , 2019, Proc. VLDB Endow..

[32]  Alexander Shraer,et al.  FoundationDB Record Layer: A Multi-Tenant Structured Datastore , 2019, SIGMOD Conference.

[33]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[34]  Michael Stonebraker,et al.  SciDB DBMS Research at M.I.T , 2013, IEEE Data Eng. Bull..

[35]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[36]  Murat Demirbas,et al.  Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases , 2014 .

[37]  Anurag Gupta,et al.  Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases , 2017, SIGMOD Conference.

[38]  Murat Demirbas,et al.  Adapting to Access Locality via Live Data Migration in Globally Distributed Datastores , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[39]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[40]  Steven Feuerstein,et al.  Oracle PL/SQL Programming , 1993 .

[41]  Ashraf Aboulnaga,et al.  Accordion: Elastic Scalability for Database Systems Supporting Distributed Transactions , 2014, Proc. VLDB Endow..

[42]  Divyakant Agrawal,et al.  DPaxos: Managing Data Closer to Users for Low-Latency and Mobile Applications , 2018, SIGMOD Conference.

[43]  Ian Rae,et al.  Online, Asynchronous Schema Change in F1 , 2013, Proc. VLDB Endow..

[44]  Itzik Ben-gan Microsoft SQL Server 2008 T-SQL Fundamentals , 2008 .

[45]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[46]  Arvind Krishnamurthy,et al.  Building consistent transactions with inconsistent replication , 2015, SOSP.

[47]  Rusty Klophaus,et al.  Riak Core: building distributed applications without shared state , 2010, CUFP '10.

[48]  Kun Ren SLOG : Serializable , Low-latency , Geo-replicated Transactions , 2019 .

[49]  Alec Wolman,et al.  Volley: Automated Data Placement for Geo-Distributed Cloud Services , 2010, NSDI.

[50]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[51]  Nicolas Bruno,et al.  Spanner: Becoming a SQL System , 2017, SIGMOD Conference.

[52]  Marcin Zukowski,et al.  MonetDB/X100: Hyper-Pipelining Query Execution , 2005, CIDR.

[53]  Divyakant Agrawal,et al.  Minimizing Commit Latency of Transactions in Geo-Replicated Data Stores , 2015, SIGMOD Conference.

[54]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[55]  Ethan Katz-Bassett,et al.  SPANStore: cost-effective geo-replicated storage spanning multiple cloud services , 2013, SOSP.

[56]  Goetz Graefe,et al.  Volcano - An Extensible and Parallel Query Evaluation System , 1994, IEEE Trans. Knowl. Data Eng..

[57]  Lin Ma,et al.  Self-Driving Database Management Systems , 2017, CIDR.

[58]  Haiying Shen,et al.  Minimum-Cost Cloud Storage Service Across Multiple Cloud Providers , 2017, IEEE/ACM Transactions on Networking.

[59]  Leslie Lamport,et al.  The temporal logic of actions , 1994, TOPL.

[60]  Dan R. K. Ports,et al.  Serializable Snapshot Isolation in PostgreSQL , 2012, Proc. VLDB Endow..