Capturing continuous data and answering aggregate queries in probabilistic XML

Sources of data uncertainty and imprecision are numerous. A way to handle this uncertainty is to associate probabilistic annotations to data. Many such probabilistic database models have been proposed, both in the relational and in the semi-structured setting. The latter is particularly well adapted to the management of uncertain data coming from a variety of automatic processes. An important problem, in the context of probabilistic XML databases, is that of answering aggregate queries (count, sum, avg, etc.), which has received limited attention so far. In a model unifying the various (discrete) semi-structured probabilistic models studied up to now, we present algorithms to compute the distribution of the aggregation values (exploiting some regularity properties of the aggregate functions) and probabilistic moments (especially expectation and variance) of this distribution. We also prove the intractability of some of these problems and investigate approximation techniques. We finally extend the discrete model to a continuous one, in order to take into account continuous data values, such as measurements from sensor networks, and extend our algorithms and complexity results to the continuous case.

[1]  Serge Abiteboul,et al.  Foundations of Databases: The Logical Level , 1995 .

[2]  Yehoshua Sagiv,et al.  Incorporating constraints in probabilistic XML , 2009, TODS.

[3]  Sebastian Maneth,et al.  Efficient Memory Representation of XML Documents , 2005, DBPL.

[4]  Evgeny Kharlamov,et al.  Probabilistic XML via Markov Chains , 2010, Proc. VLDB Endow..

[5]  Christopher Ré,et al.  Efficient Evaluation of , 2007, DBPL.

[6]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[7]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[8]  Yuri Gurevich,et al.  The complexity of query reliability , 1998, PODS.

[9]  R. Durrett Probability: Measure Theory , 2010 .

[10]  J. Scott Provan,et al.  The Complexity of Counting Cuts and of Computing the Probability that a Graph is Connected , 1983, SIAM J. Comput..

[11]  V. S. Subrahmanian,et al.  Probabilistic interval XML , 2003, TOCL.

[12]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[13]  Phokion G. Kolaitis,et al.  Answering aggregate queries in data exchange , 2008, PODS.

[14]  Diego Calvanese,et al.  Aggregate queries over ontologies , 2008, ONISW '08.

[15]  Gottfried Vossen,et al.  Aggregate Queries Over Conditional Tables , 2002, Journal of Intelligent Information Systems.

[16]  Werner Nutt,et al.  Rewriting queries with arbitrary aggregation functions using views , 2006, TODS.

[17]  Christoph E. Koch MayBMS: A System for Managing Large Uncertain and Probabilistic Databases , 2009 .

[18]  Peter Buneman,et al.  Semistructured data , 1997, PODS.

[19]  Yehoshua Sagiv,et al.  Query efficiency in probabilistic XML models , 2008, SIGMOD Conference.

[20]  Christopher Ré,et al.  Probabilistic databases: diamonds in the dirt , 2009, CACM.

[21]  H. V. Jagadish,et al.  ProTDB: Probabilistic Data in XML , 2002, VLDB.

[22]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[23]  R. F. Hoskins,et al.  Introduction to the Theory of Distributions , 1997 .

[24]  Evgeny Kharlamov,et al.  Value joins are expensive over (probabilistic) XML , 2011, LID '11.

[25]  Gerhard Weikum,et al.  ACM Transactions on Database Systems , 2005 .

[26]  Yehoshua Sagiv,et al.  Matching Twigs in Probabilistic XML , 2007, VLDB.

[27]  Val Tannen,et al.  Models for Incomplete and Probabilistic Information , 2006, IEEE Data Eng. Bull..

[28]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[29]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[30]  Maurice van Keulen,et al.  A probabilistic XML approach to data integration , 2005, 21st International Conference on Data Engineering (ICDE'05).

[31]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[32]  F. G. Friedlander Introduction to the theory of distributions , 1982 .

[33]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[34]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[35]  Evgeny Kharlamov,et al.  Aggregate queries for discrete and continuous probabilistic XML , 2010, ICDT '10.

[36]  Robert B. Ash,et al.  Probability & Measure Theory , 1999 .

[37]  Yehoshua Sagiv,et al.  Running tree automata on probabilistic XML , 2009, PODS.

[38]  Serge Abiteboul,et al.  Querying and Updating Probabilistic Information in XML , 2006, EDBT.

[39]  Yehoshua Sagiv,et al.  Query evaluation over probabilistic XML , 2009, The VLDB Journal.

[40]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[41]  Salil P. Vadhan,et al.  Computational Complexity , 2005, Encyclopedia of Cryptography and Security.

[42]  T. S. Jayram,et al.  Efficient aggregation algorithms for probabilistic data , 2007, SODA '07.

[43]  V. S. Subrahmanian,et al.  PXML: a probabilistic semistructured data model and algebra , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[44]  Serge Abiteboul,et al.  On the complexity of managing probabilistic XML data , 2007, PODS '07.

[45]  Serge Abiteboul,et al.  On the expressiveness of probabilistic XML models , 2009, The VLDB Journal.