Being picky: processing top-k queries with set-defined selections

Focusing on the top-K items according to a ranking criterion constitutes an important functionality in many different query answering scenarios. The idea is to read only the necessary information---mostly from secondary storage---with the ultimate goal to achieve low latency. In this work, we consider processing such top-K queries under the constraint that the result items are members of a specific set, which is provided at query time. We call this restriction a set-defined selection criterion. Set-defined selections drastically influence the pros and cons of an id-ordered index vs. a score-ordered index. We present a mathematical model that allows to decide at runtime which index to choose, leading to a combined index. To improve the latency around the break even point of the two indices, we show how to benefit from a partitioned score-ordered index and present an algorithm to create such partitions based on analyzing query logs. Further performance gains can be enjoyed using approximate top-K results, with tunable result quality. The presented approaches are evaluated using both real-world and synthetic data.

[1]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[2]  Gerhard Weikum,et al.  Top-k Query Evaluation with Probabilistic Guarantees , 2004, VLDB.

[3]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[4]  Kevin Chen-Chuan Chang,et al.  RankSQL: query algebra and optimization for relational top-k queries , 2005, SIGMOD '05.

[5]  Jiawei Han,et al.  Answering top-k queries with multi-dimensional selections: the ranking cube approach , 2006, VLDB.

[6]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[7]  Òscar Celma,et al.  Music Recommendation and Discovery - The Long Tail, Long Fail, and Long Play in the Digital Music Space , 2010 .

[8]  Òscar Celma,et al.  Music recommendation and discovery in the long tail , 2008 .

[9]  Zhe Wang,et al.  Efficient top-K query calculation in distributed networks , 2004, PODC '04.

[10]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[11]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[12]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[13]  Vivek R. Narasayya,et al.  Automatic workload driven index defragmentation , 2011, Proc. VLDB Endow..

[14]  R. Varshney,et al.  Supporting top-k join queries in relational databases , 2011 .

[15]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[16]  Jiawei Han,et al.  P-Cube: Answering Preference Queries in Multi-Dimensional Space , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[17]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[18]  Alistair Moffat,et al.  Vector-space ranking with effective early termination , 2001, SIGIR '01.

[19]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[20]  Sebastian Michel,et al.  Picasso - to sing, you must close your eyes and draw , 2011, SIGIR '11.

[21]  Luis Gravano,et al.  Top-k selection queries over relational databases: Mapping strategies and performance evaluation , 2002, TODS.

[22]  Torsten Suel,et al.  Faster top-k document retrieval using block-max indexes , 2011, SIGIR.

[23]  Gerhard Weikum,et al.  KLEE: A Framework for Distributed Top-k Query Algorithms , 2005, VLDB.

[24]  Marek Karpinski,et al.  Top-K color queries for document retrieval , 2011, SODA '11.

[25]  Ingmar Weber,et al.  The CompleteSearch Engine: Interactive, Efficient, and Towards IR& DB Integration , 2007, CIDR.

[26]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[27]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[28]  Xiaodan Wang,et al.  A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching , 2007, DASFAA.

[29]  Ingmar Weber,et al.  Type less, find more: fast autocompletion search with a succinct index , 2006, SIGIR.

[30]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[31]  Divesh Srivastava,et al.  Processing top-k join queries , 2010, Proc. VLDB Endow..

[32]  Martin L. Kersten,et al.  Database Cracking , 2007, CIDR.

[33]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[34]  Mudhakar Srivatsa,et al.  Efficient and Secure Search of Enterprise File Systems , 2007, IEEE International Conference on Web Services (ICWS 2007).