论文信息 - An Efficient Bulk Loading Approach of Secondary Index in Distributed Log-Structured Data Stores

An Efficient Bulk Loading Approach of Secondary Index in Distributed Log-Structured Data Stores

How to improve reading performance of Log-Structured-Merge (LSM)-tree gains much attention recently. Meanwhile, constructing secondary index for LSM data stores is a popular solution. And bulk loading of secondary index is inevitable when a new application is developed on an existing LSM data stores. However, to the best of our knowledge there are few studies on research of bulk loading of secondary index in distributed LSM-tree. In this paper, we study the performance improvement of bulk loading of secondary index in distributed LSM-tree data stores. We propose an efficient bulk loading approach of secondary index in Log-Structured Data Stores. Firstly, we design secondary index structure based on distributed LSM-tree to guarantee the scalability and consistency of secondary index. Secondly, we propose an efficient framework to handle bulk loading of secondary index in a distributed environment, which can provide a good load balancing for query processing by using equal-depth histogram to capture data distribution. Analysis of theoretical and experimental results on standard benchmark illustrate the efficacy of the proposed methods in a distributed environment.

Aoying Zhou | Zhao Zhang | Weining Qian | Peng Cai | Yanchao Zhu

[1] Patrick E. O'Neil,et al. The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[2] Liana L. Fong,et al. Diff-Index: Differentiated Index in Distributed Log-Structured Data Stores , 2014, EDBT.

[3] Wilson C. Hsieh,et al. Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[4] Chen Li,et al. AsterixDB: A Scalable, Open Source BDMS , 2014, Proc. VLDB Endow..

[5] Zhiwei Xu,et al. CCIndex: A Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range Queries , 2010, NPC.

[6] E. Brewer,et al. CAP twelve years later: How the "rules" have changed , 2012, Computer.

[7] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.