Inference of concise DTDs from XML data

We consider the problem to infer a concise Document Type Definition (DTD) for a given set of XML-documents, a problem which basically reduces to learning of concise regular expressions from positive example strings. We identify two such classes: single occurrence regular expressions (SOREs) and chain regular expressions (CHAREs). Both classes capture the far majority of the regular expressions occurring in practical DTDs and are succinct by definition. We present the algorithm iDTD (infer DTD) that learns SOREs from strings by first inferring an automaton by known techniques and then translating that automaton to a corresponding SORE, possibly by repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. We show that iDTD outperforms existing systems in accuracy, conciseness and speed. In a scenario where only a very small amount of XML data is available, for instance when generated by Web service requests or by answers to queries, iDTD produces regular expressions which are too specific. Therefore, we introduce a novel learning algorithm CRX that directly infers CHAREs (which form a subclass of SOREs) without going through an automaton representation. We show that CRX performs very well within its target class on very small data sets. Finally, we discuss incremental computation, noise, numerical predicates, and the generation of XML Schemas.

[1]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[2]  Andrzej Ehrenfeucht,et al.  Complexity measures for regular expressions , 1974, STOC '74.

[3]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[4]  Carl H. Smith,et al.  Inductive Inference: Theory and Methods , 1983, CSUR.

[5]  Leonard Pitt,et al.  Inductive Inference, DFAs, and Computational Complexity , 1989, AII.

[6]  Enrique Vidal,et al.  Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  A. Brazma Efficient identification of regular expressions from representative examples , 1993, COLT '93.

[8]  Helena Ahonen,et al.  Generating grammars for structured documents using grammatical inference methods , 1994 .

[9]  J W Ballard,et al.  Data on the web? , 1995, Science.

[10]  Roy Goldman,et al.  LORE: a Lightweight Object REpository for semistructured data , 1996, SIGMOD '96.

[11]  Jeffrey D. Ullman,et al.  Representative objects: concise representations of semistructured, hierarchical data , 1997, Proceedings 13th International Conference on Data Engineering.

[12]  Yasubumi Sakakibara,et al.  Recent Advances of Grammatical Inference , 1997, Theor. Comput. Sci..

[13]  Dan Suciu,et al.  Adding Structure to Unstructured Data , 1997, ICDT.

[14]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[15]  Derick Wood,et al.  One-Unambiguous Regular Languages , 1998, Inf. Comput..

[16]  Dan Suciu,et al.  Optimizing regular path expressions using graph schemas , 1998, Proceedings 14th International Conference on Data Engineering.

[17]  Serge Abiteboul,et al.  Extracting schema from semistructured data , 1998, SIGMOD '98.

[18]  Alin Deutsch,et al.  Storing semistructured data with STORED , 1999, SIGMOD '99.

[19]  Arnaud Sahuguet Everything You Ever Wanted to Know About DTDs, But Were Afraid to Ask , 2000, WebDB.

[20]  Ioana Manolescu,et al.  Answering XML Queries on Heterogeneous Data Sources , 2001, VLDB.

[21]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[22]  Boris Chidlovskii Schema Extraction from XML: A Grammatical Inference Approach , 2001, KRDB.

[23]  J. Clark,et al.  RELAX NG specification , 2001 .

[24]  Raymond K. Wong,et al.  Structural inference for semistructured data , 2001, CIKM '01.

[25]  Denilson Barbosa,et al.  ToXgene: An extensible template-based data generator for XML , 2002, WebDB.

[26]  Eric van der Vlist,et al.  XML Schema , 2002 .

[27]  Philip A. Bernstein,et al.  Applying Model Management to Classical Meta Data Problems , 2003, CIDR.

[28]  Thomas Schwentick,et al.  XPath Containment in the Presence of Disjunction, DTDs, and Variables , 2003, ICDT.

[29]  Denilson Barbosa,et al.  The XML web: a first study , 2003, WWW '03.

[30]  Ge Yu,et al.  Effective schema-based XML query optimization techniques , 2003, Seventh International Database Engineering and Applications Symposium, 2003. Proceedings..

[31]  Chin-Wan Chung,et al.  Efficient extraction of schemas for XML documents , 2003, Inf. Process. Lett..

[32]  Frank Neven,et al.  DTDs versus XML schema: a practical study , 2004, WebDB '04.

[33]  Manuel Delgado,et al.  Approximation to the Smallest Regular Expression for a Given Regular Language , 2004, CIAA.

[34]  Kyuseok Shim,et al.  XTRACT: Learning Document Type Descriptors from XML Document Collections , 2004, Data Mining and Knowledge Discovery.

[35]  Sergey Melnik,et al.  Generic Model Management: Concepts And Algorithms (Lecture Notes in Computer Science) , 2004 .

[36]  Henning Fernau,et al.  Extracting Minimum Length Document Type Definitions Is NP-Hard , 2004, ICGI.

[37]  Sergey Melnik,et al.  Generic Model Management , 2004, Lecture Notes in Computer Science.

[38]  Stefanie Scherzinger,et al.  Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams , 2004, VLDB.

[39]  Derick Wood,et al.  Shorter Regular Expressions from Finite-State Automata , 2005, CIAA.

[40]  Arthur H. M. ter Hofstede,et al.  Guided Interaction: A Language and Method for Incremental Revelation of Software Interfaces for Ad Hoc Interaction , 2005, Business Process Management Workshops.

[41]  Michael Benedikt,et al.  XPath satisfiability in the presence of DTDs , 2008, JACM.

[42]  Henning Fernau,et al.  Algorithms for Learning Regular Expressions , 2005, ALT.

[43]  Anne H. H. Ngu,et al.  Automatic Discovery and Inferencing of Complex Bioinformatics Web Interfaces , 2005, World Wide Web.

[44]  Daniela Florescu Managing Semi-Structured Data , 2005, ACM Queue.

[45]  Thomas Schwentick,et al.  Expressiveness and complexity of XML Schema , 2006, TODS.

[46]  Denilson Barbosa,et al.  Studying the XML Web: Gathering Statistics from an XML Sample , 2006, World Wide Web.