Enabling data retrieval: by ranking and beyond

The ubiquitous usage of databases for managing structured data, compounded with the expanded reach of the Internet to end users, has brought forward new scenarios of data retrieval. Users often want to express non-traditional fuzzy queries with soft criteria, in contrast to Boolean queries, and to explore what choices are available in databases and how they match the query criteria. Conventional database management systems (DBMS s) have become increasingly inadequate for such new scenarios. Towards enabling data retrieval, this thesis first studies how to fundamentally integrate ranking into databases. We built RankSQL, a DBMS that provides systematic and principled support of ranking queries. With a new ranking algebra and an extended query optimizer for the algebra, RankSQL captures ranking as a first-class construct in databases, together with traditional Boolean constructs. We invented efficient techniques for answering ad-hoc ranking aggregate queries. RankSQL provides significant performance improvement over current DBMSs in processing ranking queries and ranking aggregate queries. This thesis further studies how to enable retrieval mechanisms beyond just ranking. Our explorative study in this direction is exemplified by two novel proposals—One is to integrate clustering and ranking of database query results; the other is to support inverse ranking queries that provide ranks of objects in query context. Injecting such non-traditional facilities into databases presents non-trivial challenges in both defining query semantics and designing query processing methods. We extended SQL language to express such queries and invented partition- and summary-driven approaches to process them.

[1]  Divesh Srivastava,et al.  Answering Queries with Aggregation Using Views , 1996, VLDB.

[2]  Patrick E. O'Neil,et al.  Improved query performance with variant indexes , 1997, SIGMOD '97.

[3]  Guido Moerkotte,et al.  A Combined Framework for Grouping and Order Optimization , 2004, VLDB.

[4]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[5]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[6]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[7]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[8]  Aristides Gionis,et al.  Automated Ranking of Database Query Results , 2003, CIDR.

[9]  Kevin Chen-Chuan Chang,et al.  RankSQL: query algebra and optimization for relational top-k queries , 2005, SIGMOD '05.

[10]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[11]  Yuguo Chen,et al.  Efficient maintenance of materialized top-k views , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[12]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[13]  Kyuseok Shim,et al.  Including Group-By in Query Optimization , 1994, VLDB.

[14]  Goetz Graefe,et al.  The Volcano optimizer generator: extensibility and efficient search , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[15]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[16]  Rada Chirkova,et al.  Selecting and Using Views to Compute Aggregate Queries (Extended Abstract) , 2005, ICDT.

[17]  Jeffrey F. Naughton,et al.  On the Computation of Multidimensional Aggregates , 1996, VLDB.

[18]  Jürg Nievergelt,et al.  The Grid File: An Adaptable, Symmetric Multikey File Structure , 1984, TODS.

[19]  Surya Nepal,et al.  Query processing issues in image (multimedia) databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[20]  James Allan,et al.  Improving Interactive Retrieval by Combining Ranked List and Clustering , 2000, RIAO.

[21]  Eugene J. Shekita,et al.  Fundamental techniques for order optimization , 1996, SIGMOD '96.

[22]  Kevin Chen-Chuan Chang,et al.  RankSQL: Supporting Ranking Queries in Relational Database Management Systems , 2005, VLDB.

[23]  Divesh Srivastava,et al.  Effective computation of biased quantiles over data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[24]  Arie Shoshani,et al.  Optimizing bitmap indices with efficient compression , 2006, TODS.

[25]  Surajit Chaudhuri,et al.  Optimization of queries with user-defined predicates , 1996, TODS.

[26]  Alfons Kemper,et al.  Exploiting early sorting and early partitioning for decision support query processing , 2000, The VLDB Journal.

[27]  Goetz Graefe The Cascades Framework for Query Optimization , 1995, IEEE Data Eng. Bull..

[28]  Yufei Tao,et al.  Efficient Quantile Retrieval on Multi-dimensional Data , 2006, EDBT.

[29]  Yannis E. Ioannidis,et al.  An efficient bitmap encoding scheme for selection queries , 1999, SIGMOD '99.

[30]  Sanjay Ranka,et al.  A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data , 1997, VLDB.

[31]  Walid G. Aref,et al.  Joining Ranked Inputs in Practice , 2002, VLDB.

[32]  Rakesh Agrawal,et al.  A One-Pass Space-Efficient Algorithm for Finding Quantiles , 1995, COMAD.

[33]  Kevin Chen-Chuan Chang,et al.  Supporting ad-hoc ranking aggregates , 2006, SIGMOD Conference.

[34]  Hua-Gang Li,et al.  Progressive ranking of range aggregates , 2007, Data Knowl. Eng..

[35]  Seung-won Hwang,et al.  Minimal probing: supporting expensive predicates for top-k queries , 2002, SIGMOD '02.

[36]  Michael Stonebraker,et al.  Optimization of parallel query execution plans in XPRS , 2005, Distributed and Parallel Databases.

[37]  Dimitrios Gunopulos,et al.  Answering top-k queries using views , 2006, VLDB.

[38]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[39]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[40]  Yuan-Chi Chang,et al.  The onion technique: indexing for linear optimization queries , 2000, SIGMOD 2000.

[41]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[42]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[43]  Seung-won Hwang,et al.  Automatic categorization of query results , 2004, SIGMOD '04.

[44]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[45]  Per-Åke Larson,et al.  Eager Aggregation and Lazy Aggregation , 1995, VLDB.

[46]  Vasilis Vassalos,et al.  MiniCount: Efficient Rewriting of COUNT-Queries Using Views , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[47]  Luis Gravano,et al.  Evaluating top-k queries over Web-accessible databases , 2002, Proceedings 18th International Conference on Data Engineering.

[48]  Gerhard Weikum,et al.  Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? , 2005, CIDR.

[49]  Kevin Chen-Chuan Chang,et al.  Supporting ranking and clustering as generalized order-by and group-by , 2007, SIGMOD '07.

[50]  Jian Pei,et al.  Efficient computation of Iceberg cubes with complex measures , 2001, SIGMOD '01.

[51]  Vagelis Hristidis,et al.  PREFER: a system for the efficient execution of multi-parametric ranked queries , 2001, SIGMOD '01.

[52]  Patrick E. O'Neil,et al.  Model 204 Architecture and Performance , 1987, HPTS.

[53]  Luis Gravano,et al.  Evaluating Top-k Selection Queries , 1999, VLDB.

[54]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[55]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[56]  Timos K. Sellis,et al.  The Generalized Pre-Grouping Transformation: Aggregate-Query Optimization in the Presence of Dependencies , 2003, VLDB.

[57]  Pairote Sattayatham,et al.  Weighted K-Means for Density-Biased Clustering , 2005, DaWaK.

[58]  Patrick E. O'Neil,et al.  Bit-sliced index arithmetic , 2001, SIGMOD '01.

[59]  John R. Smith,et al.  Supporting Incremental Join Queries on Ranked Inputs , 2001, VLDB.

[60]  Raghu Ramakrishnan,et al.  Probabilistic Optimization of Top N Queries , 1999, VLDB.

[61]  C. Mohan,et al.  Single Table Access Using Multiple Indexes: Optimization, Execution, and Concurrency Control Techniques , 1990, EDBT.

[62]  Marianne Winslett,et al.  Bitmap indexes for large scientific data sets: a case study , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[63]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[64]  Michael Stonebraker,et al.  Predicate migration: optimizing queries with expensive predicates , 1992, SIGMOD Conference.

[65]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[66]  Irving L. Traiger,et al.  A history and evaluation of System R , 1981, CACM.

[67]  Jeffrey F. Naughton,et al.  An array-based algorithm for simultaneous multidimensional aggregates , 1997, SIGMOD '97.

[68]  Piotr Indyk,et al.  Mining the stock market (extended abstract): which measure is best? , 2000, KDD '00.

[69]  Werner Kießling,et al.  Foundations of Preferences in Database Systems , 2002, VLDB.

[70]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[71]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[72]  John N. Tsitsiklis,et al.  Introduction to linear optimization , 1997, Athena scientific optimization and computation series.

[73]  Goetz Graefe,et al.  Multi-table joins through bitmapped join indices , 1995, SGMD.

[74]  Graham Cormode,et al.  Holistic aggregates in a networked world: distributed tracking of approximate quantiles , 2005, SIGMOD '05.

[75]  Alejandro P. Buchmann,et al.  Encoded bitmap indexing for data warehouses , 1998, Proceedings 14th International Conference on Data Engineering.

[76]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[77]  Per-Ake Larson,et al.  Performing Group-By before Join , 1994, ICDE 1994.

[78]  Kenneth A. Ross,et al.  Fast Computation of Sparse Datacubes , 1997, VLDB.

[79]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[80]  Michael J. Carey,et al.  On saying “Enough already!” in SQL , 1997, SIGMOD '97.

[81]  Walid G. Aref,et al.  Rank-aware query optimization , 2004, SIGMOD '04.

[82]  Ashish Gupta,et al.  Aggregate-Query Processing in Data Warehousing Environments , 1995, VLDB.

[83]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[84]  Dimitrios Gunopulos,et al.  Efficient Approximation Of Optimization Queries Under Parametric Aggregation Constraints , 2003, VLDB.

[85]  F. Morii A Generalized K-Means Algorithm with Semi-Supervised Weight Coefficients , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[86]  Sudipto Guha,et al.  Merging the Results of Approximate Match Operations , 2004, VLDB.

[87]  Werner Nutt,et al.  Rewriting aggregate queries using views , 1999, PODS.

[88]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[89]  Divesh Srivastava,et al.  Ranked join indices , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[90]  Jeffrey D. Ullman,et al.  Index selection for OLAP , 1997, Proceedings 13th International Conference on Data Engineering.

[91]  Sriram Raghavan,et al.  Complex Queries over Web Repositories , 2003, VLDB.

[92]  Walid G. Aref,et al.  Supporting top-kjoin queries in relational databases , 2004, The VLDB Journal.

[93]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[94]  Ravi Krishnamurthy,et al.  Towards on Open Architecture for LDL , 1989, VLDB.

[95]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[96]  Jennifer Widom,et al.  Adaptive ordering of pipelined stream filters , 2004, SIGMOD '04.

[97]  Hongjun Lu,et al.  Continuously maintaining quantile summaries of the most recent N elements over a data stream , 2004, Proceedings. 20th International Conference on Data Engineering.

[98]  Jeffrey F. Naughton,et al.  Generalized Search Trees for Database Systems , 1995, VLDB.