Learning deterministic regular expressions for the inference of schemas from XML data

Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.

[1]  Michael Benedikt,et al.  XPath satisfiability in the presence of DTDs , 2008, JACM.

[2]  Henning Fernau,et al.  Algorithms for Learning Regular Expressions , 2005, ALT.

[3]  Felix Naumann,et al.  XStruct: Efficient Schema Extraction from Multiple and Large XML Documents , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[4]  Ioana Manolescu,et al.  Answering XML Queries on Heterogeneous Data Sources , 2001, VLDB.

[5]  Thomas Schwentick,et al.  Expressiveness and complexity of XML Schema , 2006, TODS.

[6]  Denilson Barbosa,et al.  Studying the XML Web: Gathering Statistics from an XML Sample , 2005, World Wide Web.

[7]  Juliana Freire,et al.  StatiX: making XML count , 2002, SIGMOD '02.

[8]  Karl Aberer,et al.  Query optimization in XML structured-document databases , 2005, The VLDB Journal.

[9]  Frank Neven,et al.  Succinctness of the Complement and Intersection of Regular Expressions , 2008, STACS.

[10]  Frank Neven,et al.  DTDs versus XML schema: a practical study , 2004, WebDB '04.

[11]  Yasubumi Sakakibara,et al.  Recent Advances of Grammatical Inference , 1997, Theor. Comput. Sci..

[12]  Leonard Pitt,et al.  Inductive Inference, DFAs, and Computational Complexity , 1989, AII.

[13]  Thomas Schwentick,et al.  Inference of concise DTDs from XML data , 2006, VLDB.

[14]  Derick Wood,et al.  One-Unambiguous Regular Languages , 1998, Inf. Comput..

[15]  Kyuseok Shim,et al.  XTRACT: Learning Document Type Descriptors from XML Document Collections , 2004, Data Mining and Knowledge Discovery.

[16]  Boris Chidlovskii Schema Extraction from XML: A Grammatical Inference Approach , 2001, KRDB.

[17]  Paul M. B. Vitányi,et al.  The Power and Perils of MDL , 2007, 2007 IEEE International Symposium on Information Theory.

[18]  Thomas Schwentick,et al.  On the complexity of XPath containment in the presence of disjunction, DTDs, and variables , 2006, Log. Methods Comput. Sci..

[19]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[20]  Henning Fernau,et al.  Extracting Minimum Length Document Type Definitions Is NP-Hard , 2004, ICGI.

[21]  Murali Mani,et al.  Taxonomy of XML schema languages using formal language theory , 2005, TOIT.

[22]  Carl H. Smith,et al.  Inductive Inference: Theory and Methods , 1983, CSUR.

[23]  Daniela Florescu Managing Semi-Structured Data , 2005, ACM Queue.

[24]  Dan Suciu,et al.  Adding Structure to Unstructured Data , 1997, ICDT.

[25]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[26]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[27]  Matthew Young-Lai,et al.  Stochastic Grammatical Inference of Text Database Structure , 2000, Machine Learning.

[28]  Frank Neven,et al.  Inferring XML Schema Definitions from XML Data , 2007, VLDB.

[29]  Andrzej Ehrenfeucht,et al.  Complexity Measures for Regular Expressions , 1976, J. Comput. Syst. Sci..

[30]  Raymond K. Wong,et al.  Structural inference for semistructured data , 2001, CIKM '01.

[31]  Arnaud Sahuguet Everything You Ever Wanted to Know About DTDs, But Were Afraid to Ask , 2000, WebDB.

[32]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[33]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[34]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[35]  Denilson Barbosa,et al.  The XML web: a first study , 2003, WWW '03.

[36]  Stefanie Scherzinger,et al.  Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams , 2004, VLDB.

[37]  Anne Brüggemann-Klein Regular Expressions into Finite Automata , 1993, Theor. Comput. Sci..

[38]  J. Clark,et al.  RELAX NG specification , 2001 .

[39]  A. Brazma Efficient identification of regular expressions from representative examples , 1993, COLT '93.

[40]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[41]  Enrique Vidal,et al.  Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[42]  Sihem Amer-Yahia,et al.  ShreX: Managing XML Documents in Relational Databases , 2004, VLDB.

[43]  Roy Goldman,et al.  LORE: a Lightweight Object REpository for semistructured data , 1996, SIGMOD '96.

[44]  Philip A. Bernstein,et al.  Applying Model Management to Classical Meta Data Problems , 2003, CIDR.

[45]  Pascal Caron,et al.  Characterization of Glushkov automata , 2000, Theor. Comput. Sci..

[46]  Serge Abiteboul,et al.  Extracting schema from semistructured data , 1998, SIGMOD '98.

[47]  Markus Holzer,et al.  Finite Automata, Digraph Connectivity, and Regular Expression Size , 2008, ICALP.