Progressive and selective merge: computing top-k with ad-hoc ranking functions

The family of threshold algorithm (ie, TA) has been widely studied for efficiently computing top-k queries. TA uses a sort-merge framework that assumes data lists are pre-sorted, and the ranking functions are monotone. However, in many database applications, attribute values are indexed by tree-structured indices (eg, B-tree, R-tree), and the ranking functions are not necessarily monotone. To answer top-k queries with ad-hoc ranking functions, this paper studies anindex-merge paradigm that performs progressive search over the space of joint states composed by multiple index nodes. We address two challenges for efficient query processing. First, to minimize the search complexity, we present a double-heap algorithm which supports not only progressive state search but also progressive state generation. Second, to avoid unnecessary disk access, we characterize a type of "empty-state" that does not contribute to the final results, and propose a new materialization model, join-signature, to prune empty-states. Our performance study shows that the proposed method achieves one order of magnitude speed-up over baseline solutions.

[1]  Christian Böhm,et al.  Determining the Convex Hull in Large Multidimensional Databases , 2001, DaWaK.

[2]  Jiawei Han,et al.  Answering top-k queries with multi-dimensional selections: the ranking cube approach , 2006, VLDB.

[3]  Luis Gravano,et al.  Evaluating top-k queries over Web-accessible databases , 2002, Proceedings 18th International Conference on Data Engineering.

[4]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[5]  Dimitris Papadias,et al.  Top-k spatial joins , 2005, IEEE Transactions on Knowledge and Data Engineering.

[6]  Jiawei Han,et al.  Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation , 2006 .

[7]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[8]  Hans-Peter Kriegel,et al.  Efficient processing of spatial joins using R-trees , 1993, SIGMOD Conference.

[9]  Konstantinos Morfonios,et al.  CURE for cubes: cubing using a ROLAP engine , 2006, VLDB.

[10]  Sukho Lee,et al.  Adaptive and Incremental Processing for Distance Join Queries , 2003, IEEE Trans. Knowl. Data Eng..

[11]  Ronald Fagin,et al.  Combining fuzzy information: an overview , 2002, SGMD.

[12]  Seung-won Hwang,et al.  Boolean + ranking: querying a database by k-constrained optimization , 2006, SIGMOD Conference.

[13]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[14]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[15]  Gerhard Weikum,et al.  KLEE: A Framework for Distributed Top-k Query Algorithms , 2005, VLDB.

[16]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[17]  Aviezri S. Fraenkel,et al.  Novel Compression of Sparse Bit-Strings — Preliminary Report , 1985 .

[18]  Ronald Fagin,et al.  Fuzzy queries in multimedia database systems , 1998, PODS '98.

[19]  Hanan Samet,et al.  Incremental distance join algorithms for spatial databases , 1998, SIGMOD '98.

[20]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[21]  Bernhard Seeger,et al.  Progressive skyline computation in database systems , 2005, TODS.

[22]  Patrick Valduriez,et al.  Join indices , 1987, TODS.

[23]  Gerhard Weikum,et al.  IO-Top-k: index-access optimized top-k query processing , 2006, VLDB.