Efficient Sensitivity Analysis for Inequality Queries in Probabilistic Databases

In this paper, we study inequality query (IQ query) processing in tuple independent probabilistic databases, where IQ queries can be categorized into IQ-path, IQ-tree, and IQ-graph queries. We focus on two related issues for IQ queries. One issue is to efficiently compute their probabilities, with the observation that the time complexity of the state-of-the-art algorithm to process IQ-graph queries is high. The other issue is to efficiently perform their sensitivity analysis, which has not been studied before. Here, sensitivity analysis is to identify input tuples that have high influence on the probability of an answer tuple, and the influence of an input tuple is defined as the difference between the output probabilities obtained in two cases, where we assume that the tuple exists in one case and does not exist in the other one. In this paper, we compile the inequality conditions of an IQ query q into a compilation tree T, which encodes the Shannon expansion order. Moreover, we split q into a set of subqueries and each contains only one inequality condition. Using compilation tree and decomposition, we introduce a dynamic programming algorithm called Dec to process an IQ query q in time O(IΦI), where Φ is the lineage of q. An IQ query can be processed by our Dec if and only if its inequality conditions can be compiled into a compilation tree T and the inequality conditions from any node to all of its child nodes must be the same in T. We conduct extensive experiments using real and synthetic datasets to demonstrate the efficiency of our algorithm for computing the probabilities and influences of IQ queries.

[1]  Jianwen Chen,et al.  Sensitivity Analysis of Answer Ordering from Probabilistic Databases , 2013, DEXA.

[2]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[3]  Davide Martinenghi,et al.  Ranking with uncertain scoring functions: semantics and sensitivity measures , 2011, SIGMOD '11.

[4]  Bin Jiang,et al.  Probabilistic Skylines on Uncertain Data , 2007, VLDB.

[5]  Dan Olteanu,et al.  Approximate confidence computation in probabilistic databases , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[6]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB journal.

[7]  Jian Li,et al.  A unified approach to ranking in probabilistic databases , 2009, The VLDB Journal.

[8]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[9]  Lise Getoor,et al.  Read-once functions and query evaluation in probabilistic databases , 2010, Proc. VLDB Endow..

[10]  Jian Li,et al.  Sensitivity analysis and explanations for robust query evaluation in probabilistic databases , 2011, SIGMOD '11.

[11]  Dan Suciu,et al.  The dichotomy of probabilistic inference for unions of conjunctive queries , 2012, JACM.

[12]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[13]  Dan Olteanu,et al.  MayBMS: a probabilistic database management system , 2009, SIGMOD Conference.

[14]  Jeffrey Xu Yu,et al.  Sliding-window top-k queries on uncertain streams , 2008, The VLDB Journal.

[15]  Jennifer Widom,et al.  Tracing the lineage of view data in a warehousing environment , 2000, TODS.

[16]  Udi Rotics,et al.  Factoring and recognition of read-once functions using cographs and normality and the readability of functions associated with partial k-trees , 2006, Discret. Appl. Math..

[17]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[18]  Dan Olteanu,et al.  Secondary-storage confidence computation for conjunctive queries with inequalities , 2009, SIGMOD Conference.

[19]  Randal E. Bryant,et al.  Symbolic Manipulation of Boolean Functions Using a Graphical Representation , 1985, 22nd ACM/IEEE Design Automation Conference.

[20]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[21]  Feifei Li,et al.  Efficient Threshold Monitoring for Distributed Probabilistic Data , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[22]  Christopher Ré,et al.  MYSTIQ: a system for finding more answers by using probabilities , 2005, SIGMOD '05.

[23]  Themis Palpanas,et al.  Top-k Nearest Neighbor Search In Uncertain Data Series , 2014, Proc. VLDB Endow..

[24]  Dan Suciu,et al.  The Complexity of Causality and Responsibility for Query Answers and non-Answers , 2010, Proc. VLDB Endow..

[25]  Dan Olteanu,et al.  Dichotomies for Queries with Negation in Probabilistic Databases , 2016, TODS.

[26]  Susanne E. Hambrusch,et al.  Orion 2.0: native support for uncertain data , 2008, SIGMOD Conference.

[27]  Neil Immerman,et al.  The Complexity of Resilience and Responsibility for Self-Join-Free Conjunctive Queries , 2015, Proc. VLDB Endow..

[28]  Christopher Ré,et al.  Approximate lineage for probabilistic databases , 2008, Proc. VLDB Endow..

[29]  Theodoros Rekatsinas,et al.  Local structure and determinism in probabilistic databases , 2012, SIGMOD Conference.