Recovering data semantics from XML documents into DTD graph with SAX

We propose a systematic approach to reverse engineer arbitrary XML documents to their conceptual schema, DTD Graphs. The necessity for doing so is due to the fact that XML documents are frequently used for storing structured data and their schemas, such as in Document Type Definition (DTD) format, are missing, especially for those existing historical XML documents. As such, it is difficult for software developers or end users to make use of them. Even the schemas exist, they are difficult to read and undetermined of the underlying relationships among the elements in the documents. In view of this, it is necessary to determine the data semantics from the XML documents. If the DTDs of the XML documents exist with the identifications of the ID/IDREF(S) type attributes, then more data semantics can be derived. Another application of the determined data semantics is to verify the linkages implemented by ID/IDREF(S). If the element is referring to an incorrect XML element type, an extra data semantic will be determined as a result, and such findings can be used for verification purposes. Furthermore, the approaches proposed in this paper use Simple API for XML (SAX) so that the algorithms are applicable to small to huge sized XML documents.

[1]  Ronaldo dos Santos Mello,et al.  A Bottom-Up Approach for Integration of XML Sources , 2001, Workshop on Information Integration on the Web.

[2]  Robert Steele,et al.  An overview of research on reverse engineering XML schemas into UML diagrams , 2005, Third International Conference on Information Technology and Applications (ICITA'05).

[3]  Ee-Peng Lim,et al.  DTD-Miner: a tool for mining DTD from XML documents , 2000, Proceedings Second International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems. WECWIS 2000.

[4]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[5]  Joseph Fong,et al.  XTOPO: An XML-Based Topology for Information Highway on the Internet , 2004, J. Database Manag..

[6]  Felix Naumann,et al.  XStruct: Efficient Schema Extraction from Multiple and Large XML Documents , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[7]  Carsten Kleiner,et al.  Automatic Generation of XML DTDs from Conceptual Database Schemas , 2001, GI Jahrestagung.

[8]  Ronaldo dos Santos Mello,et al.  A Rule-Based Conversion of a DTD to a Conceptual Schema , 2001, ER.

[9]  Joseph Fong,et al.  Converting relational database into XML documents with DOM , 2003, Inf. Softw. Technol..

[10]  Chin-Wan Chung,et al.  Efficient extraction of schemas for XML documents , 2003, Inf. Process. Lett..

[11]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[12]  Shonali Krishnaswamy,et al.  UML documentation support for XML schema , 2004, 2004 Australian Software Engineering Conference. Proceedings..

[13]  Boris Chidlovskii Schema Extraction from XML: A Grammatical Inference Approach , 2001, KRDB.