Pipelined Compaction for the LSM-Tree

Write-optimized data structures like Log-Structured Merge-tree (LSM-tree) and its variants are widely used in key-value storage systems like Big Table and Cassandra. Due to deferral and batching, the LSM-tree based storage systems need background compactions to merge key-value entries and keep them sorted for future queries and scans. Background compactions play a key role on the performance of the LSM-tree based storage systems. Existing studies about the background compaction focus on decreasing the compaction frequency, reducing I/Os or confining compactions on hot data key-ranges. They do not pay much attention to the computation time in background compactions. However, the computation time is no longer negligible, and even the computation takes more than 60% of the total compaction time in storage systems using flash based SSDs. Therefore, an alternative method to speedup the compaction is to make good use of the parallelism of underlying hardware including CPUs and I/O devices. In this paper, we analyze the compaction procedure, recognize the performance bottleneck, and propose the Pipelined Compaction Procedure (PCP) to better utilize the parallelism of CPUs and I/O devices. Theoretical analysis proves that PCP can improve the compaction bandwidth. Furthermore, we implement PCP in real system and conduct extensive experiments. The experimental results show that the pipelined compaction procedure can increase the compaction bandwidth and storage system throughput by 77% and 62% respectively.

[1]  Peter Kulchyski and , 2015 .

[2]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[3]  Erez Zadok,et al.  Building workload-independent storage with VT-trees , 2013, FAST.

[4]  Chris Jermaine,et al.  The partitioned exponential file for database storage management , 2007, The VLDB Journal.

[5]  Anastasia Ailamaki,et al.  Improving hash join performance through prefetching , 2004, Proceedings. 20th International Conference on Data Engineering.

[6]  Raghu Ramakrishnan,et al.  bLSM: a general purpose log structured merge tree , 2012, SIGMOD Conference.

[7]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[8]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[9]  Robbert van Renesse,et al.  An analysis of Facebook photo caching , 2013, SOSP.

[10]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[11]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[12]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[13]  Michael A. Bender,et al.  Cache-oblivious streaming B-trees , 2007, SPAA '07.

[14]  Erez Zadok,et al.  An efficient multi-tier tablet server storage architecture , 2011, SoCC.

[15]  Leonidas J. Guibas,et al.  Fractional cascading: I. A data structuring technique , 1986, Algorithmica.

[16]  Bingsheng He,et al.  Tree indexing on solid state drives , 2010, Proc. VLDB Endow..

[17]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.