DTD inference for views of XML data

We study the inference of Data Type Definitions (DTDs) for views of XML data, using an abstraction that focuses on document content structure. The views are defined by a query language that produces a list of documents selected from one or more input sources. The selection conditions involve vertical and horizontal navigation, thus querying explicitly the order present in input documents. We point several strong limitations in the descriptive ability of current DTDs and the need for extending them with (i) a subtyping mechanism and (ii) a more powerful specification mechanism than regular languages, such as context-free languages. With these extensions, we show that one can always infer tight DTDs, that precisely characterize a selection view on sources satisfying given DTDs. We also show important special cases where one can infer a tight DTD without requiring extension (ii). Finally we consider related problems such as verifying conformance of a view definition with a predefined DTD. Extensions to more powerful views that construct complex documents are also briefly discussed.

[1]  Catriel Beeri,et al.  Schemas for Integration and Translation of Structured and Semi-structured Data , 1999, ICDT.

[2]  Yannis Papakonstantinou,et al.  BBQ: A Visual Interface for Integrated Browsing and Querying of XML , 2000, VDB.

[3]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases and Webs , 1998, Proceedings 14th International Conference on Data Engineering.

[4]  Alberto O. Mendelzon,et al.  Finding Regular Simple Paths in Graph Databases , 1989, SIAM J. Comput..

[5]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[6]  Yannis Papakonstantinou,et al.  Query rewriting for semistructured data , 1999, SIGMOD '99.

[7]  J W Ballard,et al.  Data on the web? , 1995, Science.

[8]  Serge Abiteboul,et al.  Regular path queries with constraints , 1997, J. Comput. Syst. Sci..

[9]  Tova Milo,et al.  Using Schema Matching to Simplify Heterogeneous Data Translation , 1998, VLDB.

[10]  Dan Suciu,et al.  Optimizing regular path expressions using graph schemas , 1998, Proceedings 14th International Conference on Data Engineering.

[11]  Seymour Ginsburg,et al.  The mathematical theory of context free languages , 1966 .

[12]  Sophie Cluet,et al.  Your mediators need data conversion! , 1998, SIGMOD '98.

[13]  David Konopnicki,et al.  W3QS: A Query System for the World-Wide Web , 1995, VLDB.

[14]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[15]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[16]  Dan Suciu,et al.  Query containment for conjunctive queries with regular expressions , 1998, PODS.

[17]  Jeffrey D. Ullman,et al.  Representative objects: concise representations of semistructured, hierarchical data , 1997, Proceedings 13th International Conference on Data Engineering.

[18]  John C. Mitchell,et al.  Foundations for programming languages , 1996, Foundation of computing series.

[19]  Alin Deutsch,et al.  XML-QL: A Query Language for XML , 1998 .

[20]  Steven J. DeRose,et al.  Xml pointer language (xpointer) , 1998 .

[21]  Derick Wood,et al.  Regular Tree Languages Over Non-Ranked Alphabets , 1998 .

[22]  Frank Neven,et al.  Extensions of Attribute Grammars for Structured Document Queries , 1999, DBPL.

[23]  Chaitanya K. Baru,et al.  XML-based information mediation with MIX , 1999, SIGMOD '99.

[24]  Serge Abiteboul,et al.  Extracting schema from semistructured data , 1998, SIGMOD '98.

[25]  Dan Suciu,et al.  Type inference for queries on semistructured data , 1999, PODS '99.

[26]  Alberto O. Mendelzon,et al.  GraphLog: a visual formalism for real life recursion , 1990, PODS '90.

[27]  Yannis Papakonstantinou,et al.  Object Fusion in Mediator Systems , 1996, VLDB.

[28]  Alberto O. Mendelzon,et al.  Research Issues in Structured and Semistructured Database Programming , 1999, Lecture Notes in Computer Science.

[29]  Yannis Papakonstantinou,et al.  Enhancing semistructured data mediators with document type definitions , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[30]  Dan Suciu,et al.  Typechecking for XML transformers , 2000, J. Comput. Syst. Sci..

[31]  Dan Suciu,et al.  Catching the boat with Strudel: experiences with a Web-site management system , 1998, SIGMOD '98.

[32]  Grzegorz Rozenberg,et al.  Handbook of formal languages, vol. 3: beyond words , 1997 .

[33]  Peter Buneman,et al.  Semistructured data , 1997, PODS.

[34]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[35]  Dan Suciu,et al.  Adding Structure to Unstructured Data , 1997, ICDT.

[36]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[37]  Dan Suciu,et al.  Semistructured Data and XML , 2001, FODO.

[38]  Bertram Ludäscher,et al.  Navigation-Driven Evaluation of Virtual Mediated Views , 2000, EDBT.

[39]  John C. Mitchell,et al.  Type Systems for Programming Languages , 1991, Handbook of Theoretical Computer Science, Volume B: Formal Models and Sematics.

[40]  Thomas Schwentick,et al.  Query automata , 1999, PODS '99.

[41]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[42]  Gottfried Vossen,et al.  An Extension of Path Expressions to Simplify Navigation in Object-Oriented Queries , 1993, DOOD.

[43]  Dan Suciu,et al.  An overview of semistructured data , 1998, SIGA.

[44]  Frank Neven,et al.  Expressiveness of structured document query languages based on attribute grammars , 1998, JACM.