Modeling and querying probabilistic XML data

We survey recent results on modeling and querying probabilistic XML data. The literature contains a plethora of probabilistic XML models [2, 13, 14, 18, 21, 24, 27], and most of them can be represented by means of p-documents [18] that have, in addition to ordinary nodes, distributional nodes that specify the probabilistic process of generating a random document. The above models are families of p-documents that differ in the types of distributional nodes in use. The focus of this survey is on the tradeoff between the ability to express real-world probabilistic data (in particular, by taking correlations between atomic events into account) and the efficiency of query evaluation. We concentrate on two important issues. The first is the ability to efficiently translate a pdocument of one family into that of another. The second is the complexity of query evaluation over pdocuments (under the usual semantics of querying probabilistic data, e.g., [4, 9, 10]). It turns out that efficient evaluation of a large class of queries (i.e., twig patterns with projection and aggregate functions) is realizable in models where distributional nodes are probabilistically independent. In other models, the evaluation of a query with projection is very often intractable. In comparison, very simple conjunctive queries are intractable over probabilistic models of relational databases, even when the tuples are probabilistically independent [9, 10]. To handle the limitation exhibited by the above tradeoff, various approaches have been proposed. The first is to allow query answers to be approximate [18], which makes the evaluation of twig patterns with projection tractable in the most expressive family of p-documents, among those considered. This tractability, however, does not carry over to nonmonotonic queries, such as twig patterns with negation or aggregation. The approach presented in [7]

[1]  V. S. Subrahmanian,et al.  Probabilistic interval XML , 2003, TOCL.

[2]  Christopher Ré,et al.  Efficient Evaluation of , 2007, DBPL.

[3]  H. V. Jagadish,et al.  ProTDB: Probabilistic Data in XML , 2002, VLDB.

[4]  Divesh Srivastava,et al.  Holistic twig joins: optimal XML pattern matching , 2002, SIGMOD '02.

[5]  Leslie G. Valiant,et al.  The Complexity of Computing the Permanent , 1979, Theor. Comput. Sci..

[6]  César A. Galindo-Legaria,et al.  Outerjoins as disjunctions , 1994, SIGMOD '94.

[7]  Dan Suciu,et al.  The dichotomy of conjunctive queries on probabilistic structures , 2006, PODS.

[8]  Yehoshua Sagiv,et al.  Incorporating constraints in probabilistic XML , 2009, TODS.

[9]  Serge Abiteboul,et al.  Querying and Updating Probabilistic Information in XML , 2006, EDBT.

[10]  Stathis Zachos,et al.  Probabilistic Quantifiers and Games , 1988, J. Comput. Syst. Sci..

[11]  Michael Luby,et al.  Approximating Probabilistic Inference in Bayesian Belief Networks is NP-Hard , 1993, Artif. Intell..

[12]  Laks V. S. Lakshmanan,et al.  Minimization of tree pattern queries , 2001, SIGMOD '01.

[13]  Jennifer Widom,et al.  Databases with uncertainty and lineage , 2008, The VLDB Journal.

[14]  Mihalis Yannakakis,et al.  On Generating All Maximal Independent Sets , 1988, Inf. Process. Lett..

[15]  Yehoshua Sagiv,et al.  Matching Twigs in Probabilistic XML , 2007, VLDB.

[16]  Mitsunori Ogihara,et al.  Counting Classes are at Least as Hard as the Polynomial-Time Hierarchy , 1992, SIAM J. Comput..

[17]  Yehoshua Sagiv,et al.  Maximally joining probabilistic data , 2007, PODS.

[18]  Serge Abiteboul,et al.  On the expressiveness of probabilistic XML models , 2009, The VLDB Journal.

[19]  Mihalis Yannakakis,et al.  On the Complexity of Database Queries , 1999, J. Comput. Syst. Sci..

[20]  Richard M. Karp,et al.  Monte-Carlo Approximation Algorithms for Enumeration Problems , 1989, J. Algorithms.

[21]  Yehoshua Sagiv,et al.  Query efficiency in probabilistic XML models , 2008, SIGMOD Conference.

[22]  Leslie G. Valiant,et al.  Random Generation of Combinatorial Structures from a Uniform Distribution , 1986, Theor. Comput. Sci..

[23]  Michael R. Fellows,et al.  Parameterized Complexity , 1998 .

[24]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[25]  Maurice van Keulen,et al.  A probabilistic XML approach to data integration , 2005, 21st International Conference on Data Engineering (ICDE'05).

[26]  Yehoshua Sagiv,et al.  Full disjunctions: polynomial-delay iterators in action , 2006, VLDB.

[27]  V. S. Subrahmanian,et al.  PXML: a probabilistic semistructured data model and algebra , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[28]  Serge Abiteboul,et al.  On the complexity of managing probabilistic XML data , 2007, PODS '07.