The End of Slow Networks: It's Time for a Redesign

The next generation of high-performance networks with remote direct memory access (RDMA) capabilities requires a fundamental rethinking of the design of distributed in-memory DBMSs. These systems are commonly built under the assumption that the network is the primary bottleneck and should be avoided at all costs, but this assumption no longer holds. For instance, with InfiniBand FDR 4×, the bandwidth available to transfer data across the network is in the same ballpark as the bandwidth of one memory channel. Moreover, RDMA transfer latencies continue to rapidly improve as well. In this paper, we first argue that traditional distributed DBMS architectures cannot take full advantage of high-performance networks and suggest a new architecture to address this problem. Then, we discuss initial results from a prototype implementation of our proposed architecture for OLTP and OLAP, showing remarkable performance improvements over existing designs.

[1]  Sam Lightstone,et al.  DB2 with BLU Acceleration: So Much More than Just a Column Store , 2013, Proc. VLDB Endow..

[2]  Dhabaleswar K. Panda,et al.  High-Performance Design of Hadoop RPC with RDMA over InfiniBand , 2013, 2013 42nd International Conference on Parallel Processing.

[3]  Alfons Kemper,et al.  High-Speed Query Processing over High-Speed Networks , 2015, Proc. VLDB Endow..

[4]  Tim Kraska,et al.  MDCC: multi-data center consistency , 2012, EuroSys '13.

[5]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[6]  Gustavo Alonso,et al.  Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited , 2013, Proc. VLDB Endow..

[7]  Abdul Quamar,et al.  SWORD: scalable workload-aware data placement for transactional workloads , 2013, EDBT '13.

[8]  Norman May,et al.  The SAP HANA Database -- An Architecture Overview , 2012, IEEE Data Eng. Bull..

[9]  Daniel J. Abadi,et al.  Lightweight Locking for Main Memory Database Systems , 2012, Proc. VLDB Endow..

[10]  Tim Kraska,et al.  Building a database on S3 , 2008, SIGMOD Conference.

[11]  Kenneth A. Ross,et al.  A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort , 2014, SIGMOD Conference.

[12]  Viktor Leis,et al.  Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age , 2014, SIGMOD Conference.

[13]  Gustavo Alonso,et al.  Main-Memory Hash Joins on Modern Processor Architectures , 2015, IEEE Transactions on Knowledge and Data Engineering.

[14]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[15]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[16]  Nigel Ellis,et al.  Extreme scale with full SQL language support in microsoft SQL Azure , 2010, SIGMOD Conference.

[17]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[18]  Daniel J. Abadi,et al.  Low overhead concurrency control for partitioned main memory databases , 2010, SIGMOD Conference.

[19]  Gerhard Weikum,et al.  Federated Transaction Management with Snapshot Isolation , 1999, FMLDO.

[20]  Yi Lin,et al.  Snapshot isolation and integrity constraints in replicated databases , 2009, TODS.

[21]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[22]  Divyakant Agrawal,et al.  Squall: Fine-Grained Live Reconfiguration for Partitioned Main Memory Databases , 2015, SIGMOD Conference.

[23]  James Demmel,et al.  CA-SVM : Communication-Avoiding Support Vector Machines on Clusters , 2016 .

[24]  Tim Kraska,et al.  Tupleware: Redefining Modern Analytics , 2014, ArXiv.

[25]  Kenneth A. Ross,et al.  Track join: distributed joins with minimal network traffic , 2014, SIGMOD Conference.

[26]  Henrik Loeser,et al.  "One Size Fits All": An Idea Whose Time Has Come and Gone? , 2011, BTW.

[27]  Tim Kraska,et al.  Building Database Applications in the Cloud , 2010 .

[28]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[29]  Sayantan Sur,et al.  Memcached Design on High Performance RDMA Capable Interconnects , 2011, 2011 International Conference on Parallel Processing.

[30]  Jens Teubner,et al.  A Spinning Join That Does Not Get Dizzy , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems.

[31]  Bruce G. Lindsay,et al.  Transaction management in the R* distributed database management system , 1986, TODS.

[32]  Dhabaleswar K. Panda,et al.  High performance RDMA-based design of HDFS over InfiniBand , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Sudipta Sengupta,et al.  High Performance Transactions in Deuteronomy , 2015, CIDR.

[34]  Miguel Castro,et al.  No compromises: distributed transactions with consistency, availability, and performance , 2015, SOSP.

[35]  James C. Browne,et al.  Distributed pagerank for P2P systems , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[36]  Christian Tinnefeld,et al.  Parallel join executions in RAMCloud , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[37]  Herodotos Herodotou,et al.  Massively Parallel Databases and MapReduce Systems , 2013, Found. Trends Databases.

[38]  AlonsoGustavo,et al.  Multi-core, main-memory joins , 2013, VLDB 2013.

[39]  Daniel J. Abadi,et al.  The case for determinism in database systems , 2010, Proc. VLDB Endow..

[40]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[41]  Eddie Kohler,et al.  Speedy transactions in multicore in-memory databases , 2013, SOSP.

[42]  Gustavo Alonso,et al.  Rack-Scale In-Memory Join Processing using RDMA , 2015, SIGMOD Conference.

[43]  Ricardo Jiménez-Peris,et al.  Middleware based data replication providing snapshot isolation , 2005, SIGMOD '05.

[44]  Carsten Binnig,et al.  An Architecture for Compiling UDF-centric Workflows , 2015, Proc. VLDB Endow..

[45]  Carsten Binnig,et al.  Locality-aware Partitioning in Parallel Database Systems , 2015, SIGMOD Conference.

[46]  Alfons Kemper,et al.  Locality-sensitive operators for parallel main-memory database clusters , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[47]  Donald Kossmann,et al.  On the Design and Scalability of Distributed Shared-Data Databases , 2015, SIGMOD Conference.

[48]  Donovan A. Schneider,et al.  The Gamma Database Machine Project , 1990, IEEE Trans. Knowl. Data Eng..

[49]  Norman May,et al.  Distributed snapshot isolation: global transactions pay globally, local transactions pay locally , 2014, The VLDB Journal.

[50]  Odysseas Papapetrou,et al.  Optimizing Distributed Joins with Bloom Filters , 2008, ICDCIT.

[51]  Christian Tinnefeld,et al.  Elastic online analytical processing on RAMCloud , 2013, EDBT '13.

[52]  Ali Ghodsi,et al.  Eventual consistency today: limitations, extensions, and beyond , 2013, CACM.

[53]  Fernando Pedone,et al.  Database replication using generalized snapshot isolation , 2005, 24th IEEE Symposium on Reliable Distributed Systems (SRDS'05).

[54]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[55]  Michael Stonebraker,et al.  H-store: a high-performance, distributed main memory transaction processing system , 2008, Proc. VLDB Endow..

[56]  Parag Agrawal,et al.  The case for RAMCloud , 2011, Commun. ACM.

[57]  Robert D. Russell,et al.  A Performance Study to Guide RDMA Programming Decisions , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[58]  Wolfgang Lehner,et al.  SAP HANA distributed in-memory database system: Transaction, session, and metadata management , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[59]  Carlo Curino,et al.  Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems , 2012, SIGMOD Conference.

[60]  Angelo Pruscino Oracle RAC: Architecture and Performance , 2003, SIGMOD Conference.

[61]  Dhabaleswar K. Panda,et al.  RDMA over Ethernet — A preliminary study , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[62]  Alfons Kemper,et al.  Flow-Join: Adaptive skew handling for distributed joins over high-speed networks , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).