Attribute and object selection queries on objects with probabilistic attributes

Modern data processing techniques such as entity resolution, data cleaning, information extraction, and automated tagging often produce results consisting of objects whose attributes may contain uncertainty. This uncertainty is frequently captured in the form of a set of multiple mutually exclusive value choices for each uncertain attribute along with a measure of probability for alternative values. However, the lay end-user, as well as some end-applications, might not be able to interpret the results if outputted in such a form. Thus, the question is how to present such results to the user in practice, for example, to support attribute-value selection and object selection queries the user might be interested in. Specifically, in this article we study the problem of maximizing the quality of these selection queries on top of such a probabilistic representation. The quality is measured using the standard and commonly used set-based quality metrics. We formalize the problem and then develop efficient approaches that provide high-quality answers for these queries. The comprehensive empirical evaluation over three different domains demonstrates the advantage of our approach over existing techniques.

[1]  Dmitri V. Kalashnikov,et al.  Exploiting context analysis for combining multiple entity resolution systems , 2009, SIGMOD Conference.

[2]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[3]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[4]  Donald H. Kraft,et al.  A decision theory view of the information retrieval situation: An operations research approach , 1973, J. Am. Soc. Inf. Sci..

[5]  Norbert Fuhr,et al.  Evaluating different methods of estimating retrieval quality for resource selection , 2003, SIGIR.

[6]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[7]  Hector Garcia-Molina,et al.  Generic Entity Resolution with Data Confidences , 2006, CleanDB.

[8]  Sunil Prabhakar,et al.  Evaluation of probabilistic queries over imprecise data in constantly-evolving environments , 2007, Inf. Syst..

[9]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[10]  Dmitri V. Kalashnikov,et al.  Index for fast retrieval of uncertain spatial point data , 2006, GIS '06.

[11]  Sunita Sarawagi,et al.  Domain Adaptation of Conditional Probability Models Via Feature Subsetting , 2007, PKDD.

[12]  Dmitri V. Kalashnikov,et al.  Domain-independent data cleaning via analysis of entity-relationship graph , 2006, TODS.

[13]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[14]  Robert T. Moenck,et al.  Practical fast polynomial multiplication , 1976, SYMSAC '76.

[15]  Xi Zhang,et al.  Semantics and evaluation of top-k queries in probabilistic databases , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[16]  Feifei Li,et al.  Semantics of Ranking Queries for Probabilistic Data and Expected Ranks , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[17]  Takenobu Tokunaga,et al.  Selecting effective index terms using a decision tree , 2002, Nat. Lang. Eng..

[18]  Jennifer Widom,et al.  Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[19]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[20]  Sharad Mehrotra,et al.  XAR: An Integrated Framework for Information Extraction , 2009, 2009 WRI World Congress on Computer Science and Information Engineering.

[21]  N. Balakrishnan,et al.  Automatic Evaluation of Extract Summaries Using Fuzzy F-Score Measure , 2004 .

[22]  Andrew McCallum,et al.  A unified approach for schema matching, coreference and canonicalization , 2008, KDD.

[23]  Gerhard Weikum,et al.  ACM Transactions on Database Systems , 2005 .

[24]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[25]  Max Welling,et al.  Asynchronous Distributed Learning of Topic Models , 2008, NIPS.

[26]  Dmitri V. Kalashnikov,et al.  Adaptive graphical approach to entity resolution , 2007, JCDL '07.

[27]  Dmitri V. Kalashnikov,et al.  Self-tuning in Graph-Based Reference Disambiguation , 2007, DASFAA.

[28]  Tanaka Hozumi,et al.  Selecting effective index terms using a decision tree , 2002 .

[29]  Gerhard Weikum,et al.  Top-k Query Evaluation with Probabilistic Guarantees , 2004, VLDB.

[30]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[31]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.

[32]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part II. An algorithm for probabilistic indexing , 1975, J. Am. Soc. Inf. Sci..

[33]  Jie Xu,et al.  A Semantics-Based Approach for Speech Annotation of Images , 2011, IEEE Transactions on Knowledge and Data Engineering.

[34]  Daniel Sánchez,et al.  Measuring Effectiveness in Fuzzy Information Retrieval , 2000, FQAS.

[35]  Nalini Venkatasubramanian,et al.  Using Semantics for Speech Annotation of Images , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[36]  Dmitri V. Kalashnikov,et al.  Exploiting Relationships for Domain-Independent Data Cleaning , 2005, SDM.

[37]  Seung-won Hwang,et al.  Minimal probing: supporting expensive predicates for top-k queries , 2002, SIGMOD '02.

[38]  Yiming Yang,et al.  Probabilistic score estimation with piecewise logistic regression , 2004, ICML.

[39]  Don R. Swanson,et al.  A decision theoretic foundation for indexing , 1975, J. Am. Soc. Inf. Sci..

[40]  Dmitri V. Kalashnikov,et al.  Toward Managing Uncertain Spatial Information for Situational Awareness Applications , 2008, IEEE Transactions on Knowledge and Data Engineering.

[41]  Philippe Mulhem,et al.  A Method for Photograph Indexing Using Speech Annotation , 2001, IEEE Pacific Rim Conference on Multimedia.

[42]  S. Robertson The probability ranking principle in IR , 1997 .

[43]  Ted Briscoe,et al.  High Precision Extraction of Grammatical Relations , 2001, COLING.

[44]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[45]  Jian Li,et al.  Consensus answers for queries over probabilistic databases , 2008, PODS.

[46]  Susanne E. Hambrusch,et al.  The Orion Uncertain Data Management System , 2008, COMAD.

[47]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[48]  Luis Gravano,et al.  Optimizing top-k selection queries over multimedia repositories , 2004, IEEE Transactions on Knowledge and Data Engineering.

[49]  Richard C. Wilson,et al.  Fuzzy Recall and Precision for Speech Segmentation Evaluation , 2007 .

[50]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.