Converting Web pages into well-formed XML documents

The work presented is part of a Web mining agent (WMA) system under development at our Multimedia and Mobile Agent Research Laboratory. The purpose of this system is to automatically extract specific information from Web pages and appropriately format the extracted information for further use. This requires resolving problems related to the disorganized nature of the Web that may result from ill-formatted HTML-based Web pages. The desired information is extracted from the Web documents by applying a sequence of filters to these documents. Each of the filters has a specific role. We discuss the filter that is used to convert Web documents into well-formed XML documents. This conversion involves the following operations: (i) syntactic mapping of HTML to XML, (ii) resolving ambiguity introduced by HTML tagging rules, and (iii) handling errors that may occur due to improper usage of HTML by the authors. The paper presents an overview of the Web mining agent system, then gives the motivations for the conversion into XML and finally, discusses in detail the transformation process performed on the Web documents.

[1]  James A. Hendler,et al.  Ontology-based Web agents , 1997, AGENTS '97.