Diversifying Top-K Results

Top-k query processing finds a list of k results that have largest scores w.r.t the user given query, with the assumption that all the k results are independent to each other. In practice, some of the top-k results returned can be very similar to each other. As a result some of the top-k results returned are redundant. In the literature, diversified top-k search has been studied to return k results that take both score and diversity into consideration. Most existing solutions on diversified top-k search assume that scores of all the search results are given, and some works solve the diversity problem on a specific problem and can hardly be extended to general cases. In this paper, we study the diversified top-k search problem. We define a general diversified top-k search problem that only considers the similarity of the search results themselves. We propose a framework, such that most existing solutions for top-k query processing can be extended easily to handle diversified top-k search, by simply applying three new functions, a sufficient stop condition sufficient(), a necessary stop condition necessary(), and an algorithm for diversified top-k search on the current set of generated results, div-search-current(). We propose three new algorithms, namely, div-astar, div-dp, and div-cut to solve the div-search-current() problem. div-astar is an A* based algorithm, div-dp is an algorithm that decomposes the results into components which are searched using div-astar independently and combined using dynamic programming. div-cut further decomposes the current set of generated results using cut points and combines the results using sophisticated operations. We conducted extensive performance studies using two real datasets, enwiki and reuters. Our div-cut algorithm finds the optimal solution for diversified top-k search problem in seconds even for k as large as 2, 000.

[1]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[2]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[3]  Luitpold Babel,et al.  A fast algorithm for the maximum weight clique problem , 1994, Computing.

[4]  Xueqi Cheng,et al.  A unified framework for recommending diverse and relevant queries , 2011, WWW.

[5]  Egon Balas,et al.  Weighted and unweighted maximum clique algorithms with upper bounds from fractional coloring , 1996, Algorithmica.

[6]  Illya V. Hicks,et al.  Combinatorial Branch-and-Bound for the Maximum Weight Independent Set Problem , 2006 .

[7]  E. Lawler A PROCEDURE FOR COMPUTING THE K BEST SOLUTIONS TO DISCRETE OPTIMIZATION PROBLEMS AND ITS APPLICATION TO THE SHORTEST PATH PROBLEM , 1972 .

[8]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[9]  Yehoshua Sagiv,et al.  Finding and approximating top-k answers in keyword proximity search , 2006, PODS '06.

[10]  Yi Zhang,et al.  Novelty and redundancy detection in adaptive filtering , 2002, SIGIR '02.

[11]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[12]  Patric R. J. Östergård,et al.  A New Algorithm for the Maximum-Weight Clique Problem , 1999, Electron. Notes Discret. Math..

[13]  Aijun An,et al.  Keyword Search in Graphs: Finding r-cliques , 2011, Proc. VLDB Endow..

[14]  Nick Koudas,et al.  Efficient diversity-aware search , 2011, SIGMOD '11.

[15]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[16]  Man Lung Yiu,et al.  Efficient top-k aggregation of ranked inputs , 2007, TODS.

[17]  Yufei Tao,et al.  Querying Communities in Relational Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[18]  Yehoshua Sagiv,et al.  Keyword proximity search in complex data graphs , 2008, SIGMOD Conference.

[19]  Dragomir R. Radev,et al.  DivRank: the interplay of prestige and diversity in information networks , 2010, KDD.

[20]  Peter Fankhauser,et al.  DivQ: diversification for keyword search over structured databases , 2010, SIGIR.

[21]  Wolf-Tilo Balke,et al.  Towards efficient multi-feature queries in heterogeneous environments , 2001, Proceedings International Conference on Information Technology: Coding and Computing.

[22]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..