A unified approach to ranking in probabilistic databases

Ranking is a fundamental operation in data analysis and decision support and plays an even more crucial role if the dataset being explored exhibits uncertainty. This has led to much work in understanding how to rank the tuples in a probabilistic dataset in recent years. In this article, we present a unified approach to ranking and top-k query processing in probabilistic databases by viewing it as a multi-criterion optimization problem and by deriving a set of features that capture the key properties of a probabilistic dataset that dictate the ranked result. We contend that a single, specific ranking function may not suffice for probabilistic databases, and we instead propose two parameterized ranking functions, called PRFω and PRFe, that generalize or can approximate many of the previously proposed ranking functions. We present novel generating functions-based algorithms for efficiently ranking large datasets according to these ranking functions, even if the datasets exhibit complex correlations modeled using probabilistic and/xor trees or Markov networks. We further propose that the parameters of the ranking function be learned from user preferences, and we develop an approach to learn those parameters. Finally, we present a comprehensive experimental study that illustrates the effectiveness of our parameterized ranking functions, especially PRFe, at approximating other ranking functions and the scalability of our proposed algorithms for exact or approximate ranking.

[1]  Yossi Azar,et al.  Multiple intents re-ranking , 2009, STOC '09.

[2]  Yoram Singer,et al.  Log-Linear Models for Label Ranking , 2003, NIPS.

[3]  Å. Björck,et al.  Solution of Vandermonde Systems of Equations , 1970 .

[4]  Christopher Ré,et al.  Event queries on correlated probabilistic streams , 2008, SIGMOD Conference.

[5]  Reynold Cheng,et al.  Evaluating probability threshold k-nearest-neighbor queries over uncertain data , 2009, EDBT '09.

[6]  Koby Crammer,et al.  Learning to create data-integrating queries , 2008, Proc. VLDB Endow..

[7]  Jian Li,et al.  Ranking continuous probabilistic datasets , 2010, Proc. VLDB Endow..

[8]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[9]  Lise Getoor,et al.  PrDB: managing and exploiting rich correlations in probabilistic databases , 2009, The VLDB Journal.

[10]  Hans-Peter Kriegel,et al.  Probabilistic Nearest-Neighbor Query on Uncertain Objects , 2007, DASFAA.

[11]  J. F. Hauer,et al.  Initial results in Prony analysis of power system response signals , 1990 .

[12]  Jeffrey Xu Yu,et al.  Sliding-window top-k queries on uncertain streams , 2008, The VLDB Journal.

[13]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[14]  Jian Pei,et al.  Ranking queries on uncertain data: a probabilistic threshold approach , 2008, SIGMOD Conference.

[15]  Christoph E. Koch MayBMS: A System for Managing Large Uncertain and Probabilistic Databases , 2009 .

[16]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[17]  Val Tannen,et al.  Models for Incomplete and Probabilistic Information , 2006, IEEE Data Eng. Bull..

[18]  Sriram Raghavan,et al.  Avatar Information Extraction System , 2006, IEEE Data Eng. Bull..

[19]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[20]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[21]  Dan Olteanu,et al.  Conditioning probabilistic databases , 2008, Proc. VLDB Endow..

[22]  Jian Li,et al.  Consensus answers for queries over probabilistic databases , 2008, PODS.

[23]  Dan Olteanu,et al.  From complete to incomplete information and back , 2007, SIGMOD '07.

[24]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[25]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[26]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[27]  Amol Deshpande,et al.  Ef?cient Query Evaluation over Temporally Correlated Probabilistic Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[28]  Christopher Ré,et al.  Managing Uncertainty in Social Networks , 2007, IEEE Data Eng. Bull..

[29]  Yossi Azar,et al.  Ranking with submodular valuations , 2010, SODA '11.

[30]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[31]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[32]  Stanley B. Zdonik,et al.  Top-k queries on uncertain data: on score distribution and typical answers , 2009, SIGMOD Conference.

[33]  Frank Jensen,et al.  Optimal junction Trees , 1994, UAI.

[34]  Samuel Madden,et al.  Using Probabilistic Models for Data Management in Acquisitional Environments , 2005, CIDR.

[35]  K. Obermayer,et al.  Learning Preference Relations for Information Retrieval , 1998 .

[36]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[37]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[38]  G. Beylkin,et al.  On approximation of functions by exponential sums , 2005 .

[39]  Jeffrey Xu Yu,et al.  Sliding-window top-k queries on uncertain streams , 2008, Proc. VLDB Endow..

[40]  Joseph Naor,et al.  Approximation Algorithms for Diversified Search Ranking , 2010, ICALP.

[41]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[43]  E. Fama,et al.  Risk, Return, and Equilibrium: Empirical Tests , 1973, Journal of Political Economy.

[44]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[45]  J. H. Lint Concrete mathematics : a foundation for computer science / R.L. Graham, D.E. Knuth, O. Patashnik , 1990 .

[46]  Chi-Yin Chow,et al.  Probabilistic Verifiers: Evaluating Constrained Nearest-Neighbor Queries over Uncertain Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[47]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[48]  Anastasia Ailamaki,et al.  Challenges inbuilding a DBMS Resource Advisor , 2006, IEEE Data Eng. Bull..

[49]  Ihab F. Ilyas,et al.  Ranking with Uncertain Scores , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[50]  Ihab F. Ilyas,et al.  Efficient search for the top-k probable nearest neighbors in uncertain databases , 2008, Proc. VLDB Endow..

[51]  Xi Zhang,et al.  Semantics and evaluation of top-k queries in probabilistic databases , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[52]  Feifei Li,et al.  Efficient Processing of Top-k Queries in Uncertain Databases with x-Relations , 2008, IEEE Trans. Knowl. Data Eng..

[53]  Feifei Li,et al.  Semantics of Ranking Queries for Probabilistic Data and Expected Ranks , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[54]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[55]  LiJian,et al.  A unified approach to ranking in probabilistic databases , 2011, VLDB 2011.

[56]  Jianliang Xu,et al.  k-Selection Query over Uncertain Data , 2010, DASFAA.

[57]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[58]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[59]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[60]  Xi Zhang,et al.  On the semantics and evaluation of top-k queries in probabilistic databases , 2008, ICDE Workshops.

[61]  Eytan Domany,et al.  Ranking Under Uncertainty , 2012, UAI.

[62]  Ronald L. Graham,et al.  Concrete mathematics - a foundation for computer science , 1991 .

[63]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[64]  Amol Deshpande,et al.  Indexing correlated probabilistic databases , 2009, SIGMOD Conference.

[65]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.