A Heuristic Approach for Converting HTML Documents to XML Documents

XML is rapidly emerging, and yet there still exist numerous HTML documents on the Web. In this paper, we present a heuristic approach for converting HTML documents to XML documents. During the conversion process, we eliminate all the HTML elements in an HTML document from the resulting XML document since these elements are designed for the display of data exclusively, but retain the character data of each element along with the implicit hierarchy among the data. The proposed conversion approach extracts the data hierarchy of HTML documents as closely as possible with no human intervention. The approach can be adopted to construct the data hierarchy of an HTML document and to collect data in HTML documents into an XML repository.

[1]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[2]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases and Webs , 1998, Proceedings 14th International Conference on Data Engineering.

[3]  Yiu-Kai Ng,et al.  WebView: a tool for retrieving internal structures and extracting information from HTML documents , 1999, Proceedings. 6th International Conference on Advanced Systems for Advanced Applications.

[4]  Alberto O. Mendelzon,et al.  Formal models of Web queries , 1997, Inf. Syst..

[5]  Paolo Atzeni,et al.  Cut and paste , 1997, PODS '97.

[6]  Arnaud Sahuguet,et al.  Looking at the Web through XML glasses , 1999, Proceedings Fourth IFCIS International Conference on Cooperative Information Systems. CoopIS 99 (Cat. No.PR00384).

[7]  Alberto O. Mendelzon,et al.  Querying the World Wide Web , 1996, Fourth International Conference on Parallel and Distributed Information Systems.