Aggregate queries for discrete and continuous probabilistic XML

Sources of data uncertainty and imprecision are numerous. A way to handle this uncertainty is to associate probabilistic annotations to data. Many such probabilistic database models have been proposed, both in the relational and in the semi-structured setting. The latter is particularly well adapted to the management of uncertain data coming from a variety of automatic processes. An important problem, in the context of probabilistic XML databases, is that of answering aggregate queries (count, sum, avg, etc.), which has received limited attention so far. In a model unifying the various (discrete) semi-structured probabilistic models studied up to now, we present algorithms to compute the distribution of the aggregation values (exploiting some regularity properties of the aggregate functions) and probabilistic moments (especially, expectation and variance) of this distribution. We also prove the intractability of some of these problems and investigate approximation techniques. We finally extend the discrete model to a continuous one, in order to take into account continuous data values, such as measurements from sensor networks, and present algorithms to compute distribution functions and moments for various classes of continuous distributions of data values.

[1]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[2]  Yehoshua Sagiv,et al.  Query efficiency in probabilistic XML models , 2008, SIGMOD Conference.

[3]  Yehoshua Sagiv,et al.  Incorporating constraints in probabilistic XML , 2009, TODS.

[4]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[5]  Yehoshua Sagiv,et al.  Matching Twigs in Probabilistic XML , 2007, VLDB.

[6]  Phokion G. Kolaitis,et al.  Answering aggregate queries in data exchange , 2008, PODS.

[7]  Evgeny Kharlamov,et al.  Agrégation de documents XML probabilistes ∗ , 2009 .

[8]  Gottfried Vossen,et al.  Aggregate Queries Over Conditional Tables , 2002, Journal of Intelligent Information Systems.

[9]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[10]  Yuri Gurevich,et al.  The complexity of query reliability , 1998, PODS.

[11]  V. S. Subrahmanian,et al.  Probabilistic interval XML , 2003, TOCL.

[12]  T. S. Jayram,et al.  Efficient aggregation algorithms for probabilistic data , 2007, SODA '07.

[13]  V. S. Subrahmanian,et al.  PXML: a probabilistic semistructured data model and algebra , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[14]  Serge Abiteboul,et al.  On the complexity of managing probabilistic XML data , 2007, PODS '07.

[15]  Robert B. Ash,et al.  Probability & Measure Theory , 1999 .

[16]  J. Scott Provan,et al.  The Complexity of Counting Cuts and of Computing the Probability that a Graph is Connected , 1983, SIAM J. Comput..

[17]  Yehoshua Sagiv,et al.  Running tree automata on probabilistic XML , 2009, PODS.

[18]  H. V. Jagadish,et al.  ProTDB: Probabilistic Data in XML , 2002, VLDB.

[19]  Werner Nutt,et al.  Rewriting queries with arbitrary aggregation functions using views , 2006, TODS.

[20]  Christos H. Papadimitriou,et al.  Computational complexity , 1993 .

[21]  Sarath Kumar Kondreddi,et al.  A Probabilistic XML Approach to Data Integration , 2009 .

[22]  Serge Abiteboul,et al.  Querying and Updating Probabilistic Information in XML , 2006, EDBT.

[23]  Yehoshua Sagiv,et al.  Query evaluation over probabilistic XML , 2009, The VLDB Journal.

[24]  Christopher Ré,et al.  Efficient Evaluation of , 2007, DBPL.

[25]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[26]  Serge Abiteboul,et al.  On the expressiveness of probabilistic XML models , 2009, The VLDB Journal.

[27]  Christopher Ré,et al.  Probabilistic databases: diamonds in the dirt , 2009, CACM.

[28]  Diego Calvanese,et al.  Aggregate queries over ontologies , 2008, ONISW '08.