Efficient Top-k Retrieval on Massive Data

In many applications, top- k query is an important operation to return a set of interesting points in a potentially huge data space. It is analyzed in this paper that the existing algorithms cannot process top- k query on massive data efficiently. This paper proposes a novel table-scan-based T2S algorithm to efficiently compute top- k results on massive data. T2S first constructs the presorted table, whose tuples are arranged in the order of the round-robin retrieval on the sorted lists. T2S maintains only fixed number of tuples to compute results. The early termination checking for T2S is presented in this paper, along with the analysis of scan depth. The selective retrieval is devised to skip the tuples in the presorted table which are not top- k results. The theoretical analysis proves that selective retrieval can reduce the number of the retrieved tuples significantly. The construction and incremental-update/batch-processing methods for the used structures are proposed in this paper. The extensive experimental results, conducted on synthetic and real-life data sets, show that T2S has a significant advantage over the existing algorithms.

[1]  Seung-won Hwang,et al.  Efficient Dual-Resolution Layer Indexing for Top-k Queries , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[2]  Yuan-Chi Chang,et al.  The onion technique: indexing for linear optimization queries , 2000, SIGMOD 2000.

[3]  Dimitrios Gunopulos,et al.  Answering top-k queries using views , 2006, VLDB.

[4]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[5]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[6]  Gerhard Weikum,et al.  IO-Top-k: index-access optimized top-k query processing , 2006, VLDB.

[7]  Lei Zou,et al.  Pareto-Based Dominant Graph: An Efficient Indexing Structure to Answer Top-K Queries , 2011, IEEE Trans. Knowl. Data Eng..

[8]  Laks V. S. Lakshmanan,et al.  Efficient top-k query answering using cached views , 2013, EDBT '13.

[9]  Jianzhong Li,et al.  Efficient Skyline Computation on Big Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[10]  Jianzhong Li,et al.  Supporting early pruning in top-k query processing on massive data , 2011, Inf. Process. Lett..

[11]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[12]  Junghoo Cho,et al.  Subspace top-k query processing using the hybrid-layer index with a tight bound , 2013, Data Knowl. Eng..

[13]  Xuhua Ding,et al.  Efficient processing of exact top-k queries over disk-resident sorted lists , 2010, The VLDB Journal.

[14]  Jiawei Han,et al.  Towards robust indexing for ranked queries , 2006, VLDB.

[15]  Ronald Fagin,et al.  Efficient similarity search and classification via rank aggregation , 2003, SIGMOD '03.

[16]  Arie Shoshani,et al.  Analyses of multi-level and multi-component compressed bitmap indexes , 2010, TODS.

[17]  Vagelis Hristidis,et al.  Algorithms and applications for answering ranked queries using ranked views , 2003, The VLDB Journal.

[18]  Patrick Valduriez,et al.  Best Position Algorithms for Top-k Queries , 2007, VLDB.

[19]  Jun Rao,et al.  Liquid: Unifying Nearline and Offline Big Data Integration , 2015, CIDR.

[20]  Wolf-Tilo Balke,et al.  Towards efficient multi-feature queries in heterogeneous environments , 2001, Proceedings International Conference on Information Technology: Coding and Computing.

[21]  Man Lung Yiu,et al.  Efficient top-k aggregation of ranked inputs , 2007, TODS.