Optimal Probabilistic Generation of XML Documents

We study the problem of, given a corpus of XML documents and its schema, finding an optimal (generative) probabilistic model, where optimality here means maximizing the likelihood of the particular corpus to be generated. Focusing first on the structure of documents, we present an efficient algorithm for finding the best generative probabilistic model, in the absence of constraints. We further study the problem in the presence of integrity constraints, namely key, inclusion, and domain constraints. We study in this case two different kinds of generators. First, we consider a continuation-test generator that performs, while generating documents, tests of schema satisfiability; these tests prevent from generating a document violating the constraints but, as we will see, they are computationally expensive. We also study a restart generator that may generate an invalid document and, when this is the case, restarts and tries again. Finally, we consider the injection of data values into the structure, to obtain a full XML document. We study different approaches for generating these values.

[1]  Serge Abiteboul,et al.  Finding optimal probabilistic generators for XML collections , 2012, ICDT '12.

[2]  Denilson Barbosa,et al.  ToXgene: a template-based data generator for XML , 2002, SIGMOD '02.

[3]  Vladimir Solmon,et al.  The estimation of stochastic context-free grammars using the Inside-Outside algorithm , 2003 .

[4]  Steve Young,et al.  Applications of stochastic context-free grammars using the Inside-Outside algorithm , 1990 .

[5]  Evgeny Kharlamov,et al.  Probabilistic XML via Markov Chains , 2010, Proc. VLDB Endow..

[6]  Frank Neven,et al.  Generating, sampling and counting subclasses of regular tree languages , 2011, ICDT '11.

[7]  Sara Cohen Generating XML structure using examples and constraints , 2008, Proc. VLDB Endow..

[8]  Kousha Etessami,et al.  Recursive Markov chains, stochastic grammars, and monotone systems of nonlinear equations , 2005, JACM.

[9]  Serge Abiteboul,et al.  Extracting schema from semistructured data , 1998, SIGMOD '98.

[10]  Denilson Barbosa,et al.  ToXgene: An extensible template-based data generator for XML , 2002, WebDB.

[11]  Wenfei Fan,et al.  On XML integrity constraints in the presence of DTDs , 2001, JACM.

[12]  Frank Neven,et al.  Inferring XML Schema Definitions from XML Data , 2007, VLDB.

[13]  Yehoshua Sagiv,et al.  Incorporating constraints in probabilistic XML , 2009, TODS.

[14]  Zhiyi Chi,et al.  Estimation of Probabilistic Context-Free Grammars , 1998, Comput. Linguistics.

[15]  Thomas Schwentick,et al.  Inference of concise DTDs from XML data , 2006, VLDB.

[16]  Kyuseok Shim,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD '00.

[17]  Murali Mani,et al.  Taxonomy of XML schema languages using formal language theory , 2005, TOIT.

[18]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[19]  Dan Suciu,et al.  Type inference for queries on semistructured data , 1999, PODS '99.

[20]  Serge Abiteboul,et al.  Auto-completion learning for XML , 2012, SIGMOD Conference.

[21]  R. Steele Optimization , 2005 .

[22]  Serge Abiteboul,et al.  The Active XML project: an overview , 2008, The VLDB Journal.

[23]  Serge Abiteboul,et al.  The AXML Artifact Model , 2009, 2009 16th International Symposium on Temporal Representation and Reasoning.

[24]  Gösta Grahne,et al.  Discovering approximate keys in XML data , 2002, CIKM '02.

[25]  Evgeny Kharlamov,et al.  Aggregate queries for discrete and continuous probabilistic XML , 2010, ICDT '10.

[26]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[27]  Frank Neven,et al.  Simplifying XML Schema: Single-type approximations of regular tree languages , 2013, J. Comput. Syst. Sci..

[28]  Maurice Bruynooghe,et al.  Information extraction from structured documents using k-testable tree automaton inference , 2006, Data Knowl. Eng..

[29]  Yannis Papakonstantinou,et al.  DTD inference for views of XML data , 2000, PODS.

[30]  W. Marsden I and J , 2012 .

[31]  Thomas Schwentick,et al.  Expressiveness and complexity of XML Schema , 2006, TODS.

[32]  Joachim Niehren,et al.  On the minimization of XML Schemas and tree automata for unranked trees , 2007, J. Comput. Syst. Sci..

[33]  Serge Abiteboul,et al.  On the expressiveness of probabilistic XML models , 2009, The VLDB Journal.

[34]  Claire David,et al.  Efficient reasoning about data trees via integer linear programming , 2012, TODS.

[35]  Frank Neven,et al.  Learning deterministic regular expressions for the inference of schemas from XML data , 2008, WWW.