RankSQL: query algebra and optimization for relational top-k queries

This paper introduces RankSQL, a system that provides a systematic and principled framework to support efficient evaluations of ranking (top-k) queries in relational database systems (RDBMS), by extending relational algebra and query optimization. Previously, top-k query processing is studied in the middleware scenario or in RDBMS in a "piecemeal" fashion, i.e., focusing on specific operator or sitting outside the core of query engines. In contrast, we aim to support ranking as a first-class database construct. As a key insight, the new ranking relationship can be viewed as another logical property of data, parallel to the "membership" property of relational data model. While membership is essentially supported in RDBMS, the same support for ranking is clearly lacking. We address the fundamental integration of ranking in RDBMS in a way similar to how membership, i.e., Boolean filtering, is supported. We extend relational algebra by proposing a rank-relational model to capture the ranking property, and introducing new and extended operators to support ranking as a first-class construct. Enabled by the extended algebra, we present a pipelined and incremental execution model of ranking query plans (that cannot be expressed traditionally) based on a fundamental ranking principle. To optimize top-k queries, we propose a dimensional enumeration algorithm to explore the extended plan space by enumerating plans along two dual dimensions: ranking and membership. We also propose a sampling-based method to estimate the cardinality of rank-aware operators, for costing plans. Our experiments show the validity of our framework and the accuracy of the proposed estimation model.

[1]  Divesh Srivastava,et al.  Ranked join indices , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[2]  Surajit Chaudhuri,et al.  Optimization of queries with user-defined predicates , 1996, TODS.

[3]  Luis Gravano,et al.  Evaluating top-k queries over Web-accessible databases , 2002, Proceedings 18th International Conference on Data Engineering.

[4]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[5]  Surya Nepal,et al.  Query processing issues in image (multimedia) databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[6]  Luis Gravano,et al.  Evaluating Top-k Selection Queries , 1999, VLDB.

[7]  Sriram Raghavan,et al.  Complex Queries over Web Repositories , 2003, VLDB.

[8]  Dimitrios Gunopulos,et al.  Efficient Approximation Of Optimization Queries Under Parametric Aggregation Constraints , 2003, VLDB.

[9]  Werner Kießling,et al.  Foundations of Preferences in Database Systems , 2002, VLDB.

[10]  Donald D. Chamberlin,et al.  Access Path Selection in a Relational Database Management System , 1989 .

[11]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[12]  Jennifer Widom,et al.  Adaptive ordering of pipelined stream filters , 2004, SIGMOD '04.

[13]  Sudipto Guha,et al.  Merging the Results of Approximate Match Operations , 2004, VLDB.

[14]  Ronald Fagin,et al.  Combining fuzzy information from multiple systems (extended abstract) , 1996, PODS.

[15]  Michael J. Carey,et al.  On saying “Enough already!” in SQL , 1997, SIGMOD '97.

[16]  Walid G. Aref,et al.  Rank-aware query optimization , 2004, SIGMOD '04.

[17]  Walid G. Aref,et al.  Joining Ranked Inputs in Practice , 2002, VLDB.

[18]  Seung-won Hwang,et al.  Minimal probing: supporting expensive predicates for top-k queries , 2002, SIGMOD '02.

[19]  Michael Stonebraker,et al.  Optimization of parallel query execution plans in XPRS , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[20]  Goetz Graefe The Cascades Framework for Query Optimization , 1995, IEEE Data Eng. Bull..

[21]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[22]  Michael Stonebraker,et al.  Predicate migration: optimizing queries with expensive predicates , 1992, SIGMOD Conference.

[23]  John R. Smith,et al.  Supporting Incremental Join Queries on Ranked Inputs , 2001, VLDB.

[24]  Vagelis Hristidis,et al.  PREFER: a system for the efficient execution of multi-parametric ranked queries , 2001, SIGMOD '01.

[25]  Yuguo Chen,et al.  Efficient maintenance of materialized top-k views , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[26]  Goetz Graefe,et al.  The Volcano optimizer generator: extensibility and efficient search , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[27]  John R. Smith,et al.  The onion technique: indexing for linear optimization queries , 2000, SIGMOD '00.

[28]  Ravi Krishnamurthy,et al.  Towards on Open Architecture for LDL , 1989, VLDB.

[29]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[30]  A. N. Wilschut,et al.  Dataflow query execution in a parallel main-memory environment , 1991, Distributed and Parallel Databases.