Scaling concurrent log-structured data stores

Log-structured data stores (LSM-DSs) are widely accepted as the state-of-the-art implementation of key-value stores. They replace random disk writes with sequential I/O, by accumulating large batches of updates in an in-memory data structure and merging it with the on-disk store in the background. While LSM-DS implementations proved to be highly successful at masking the I/O bottleneck, scaling them up on multicore CPUs remains a challenge. This is nontrivial due to their often rich APIs, as well as the need to coordinate the RAM access with the background I/O. We present cLSM, an algorithm for scalable concurrency in LSM-DS, which exploits multiprocessor-friendly data structures and non-blocking synchronization. cLSM supports a rich API, including consistent snapshot scans and general non-blocking read-modify-write operations. We implement cLSM based on the popular LevelDB key-value store, and evaluate it using intensive synthetic workloads as well as ones from production web-serving applications. Our algorithm outperforms state of the art LSM-DS implementations, improving throughput by 1.5x to 2.5x. Moreover, cLSM demonstrates superior scalability with the number of cores (successfully exploiting twice as many cores as the competition).

[1]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[2]  Christos H. Papadimitriou,et al.  The serializability of concurrent database updates , 1979, JACM.

[3]  Jon Louis Bentley,et al.  Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[4]  S. B. Yao,et al.  Efficient locking for concurrent operations on B-trees , 1981, TODS.

[5]  William Pugh,et al.  Skip Lists: A Probabilistic Alternative to Balanced Trees , 1989, WADS.

[6]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[7]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[8]  Andrew S. Tanenbaum,et al.  Modern Operating Systems , 1992 .

[9]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[10]  Gerhard Weikum,et al.  The LHAM log-structured history data access method , 2000, The VLDB Journal.

[11]  M MichaelMaged Scalable lock-free dynamic memory allocation , 2004 .

[12]  Maged M. Michael Scalable lock-free dynamic memory allocation , 2004, PLDI '04.

[13]  Leonidas J. Guibas,et al.  Fractional cascading: I. A data structuring technique , 1986, Algorithmica.

[14]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[15]  D. M. Hutton,et al.  The Art of Multiprocessor Programming , 2008 .

[16]  Joe Duffy Concurrent Programming on Windows , 2008 .

[17]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[18]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[19]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[20]  A practical concurrent binary search tree , 2010, PPOPP.

[21]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[22]  Bingsheng He,et al.  Tree indexing on solid state drives , 2010, Proc. VLDB Endow..

[23]  Kunle Olukotun,et al.  A practical concurrent binary search tree , 2010, PPoPP '10.

[24]  Frank Dabek,et al.  Large-scale Incremental Processing Using Distributed Transactions and Notifications , 2010, OSDI.

[25]  Shivnath Babu,et al.  A practical concurrent index for solid-state drives , 2012, CIKM '12.

[26]  Mikhail Bautin,et al.  Storage Infrastructure Behind Facebook Messages: Using HBase at Scale , 2012, IEEE Data Eng. Bull..

[27]  Emin Gün Sirer,et al.  HyperDex: a distributed, searchable key-value store , 2012, SIGCOMM '12.

[28]  Raghu Ramakrishnan,et al.  bLSM: a general purpose log structured merge tree , 2012, SIGMOD Conference.

[29]  Xubin He,et al.  An adaptive write buffer management scheme for flash-based SSDs , 2012, TOS.

[30]  Sudipta Sengupta,et al.  The Bw-Tree: A B-tree for new hardware platforms , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[31]  Eran Yahav,et al.  Verifying atomicity via data independence , 2014, ISSTA 2014.

[32]  Benjamin Reed,et al.  Omid: Lock-free transactional support for distributed data stores , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[33]  PLDI 2004: Scalable Lock-Free Dynamic Memory Allocation , 2015, SIGP.

[34]  Kaushik Velusamy,et al.  Modern Operating Systems , 2015 .

[35]  Luca Faust,et al.  Modern Operating Systems , 2016 .