Automatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis

Although RDF/XML has been widely recognized as the standard vehicle for representing semantic information on the Web, an enormous amount of semantic data is still being encoded in HTML documents that are designed primarily for human consumption and not directly amenable to machine processing. This paper seeks to bridge this semantic gap by addressing the fundamental problem of automatically annotating HTML documents with semantic labels. Exploiting a key observation that semantically related items exhibit consistency in presentation style as well as spatial locality in template-based content-rich HTML documents, we have developed a novel framework for automatically partitioning such documents into semantic structures. Our framework tightly couples structural analysis of documents with semantic analysis incorporating domain ontologies and lexical databases such as WordNet. We present experimental evidence of the effectiveness of our techniques on a large collection of HTML documents from various news portals.

[1]  Michael Gertz,et al.  Reverse engineering for Web data: from visual to semantic structures , 2002, Proceedings 18th International Conference on Data Engineering.

[2]  Yannis Papakonstantinou,et al.  DTD inference for views of XML data , 2000, PODS.

[3]  Steffen Staab,et al.  Semantic community Web portals , 2000, Comput. Networks.

[4]  Stefan Decker,et al.  TRIPLE - A Query, Inference, and Transformation Language for the Semantic Web , 2002, SEMWEB.

[5]  Ian Horrocks,et al.  Querying the Semantic Web: A Formal Approach , 2002, SEMWEB.

[6]  David W. Embley,et al.  Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents , 2000, WebDB.

[7]  Valter Crescenzi,et al.  RoadRunner: automatic data extraction from data-intensive web sites , 2002, SIGMOD '02.

[8]  Ian Horrocks Benchmark Analysis with FaCT , 2000, TABLEAUX.

[9]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[10]  Sasikumar Mukundan,et al.  Spinning the Semantic Web , 2004 .

[11]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[12]  Michael Kifer,et al.  FLORA-2: User's Manual , 2001 .

[13]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[14]  Serge Abiteboul,et al.  Extracting schema from semistructured data , 1998, SIGMOD '98.

[15]  Hector Garcia-Molina,et al.  Template-based wrappers in the TSIMMIS system , 1997, SIGMOD '97.

[16]  Yiu-Kai Ng,et al.  An automated change-detection algorithm for HTML documents based on semantic hierarchies , 2001, Proceedings 17th International Conference on Data Engineering.

[17]  James A. Hendler,et al.  The Semantic Web — ISWC 2002 , 2002, Lecture Notes in Computer Science.

[18]  Ian Horrocks,et al.  Combining logic programs with description logics , 2003, The Web Conference.

[19]  Steffen Staab,et al.  Authoring and annotation of web pages in CREAM , 2002, WWW.

[20]  Yu Chen,et al.  Html Page Analysis based on Visual cues , 2003, Web Document Analysis.

[21]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[22]  Steffen Staab,et al.  On deep annotation , 2003, WWW '03.

[23]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[24]  Kyuseok Shim,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD '00.

[25]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[26]  Andreas Paepcke,et al.  Focused Web searching with PDAs , 2000, Comput. Networks.

[27]  Dieter Pfoser Indexing the Trajectories of Moving Objects , 2002 .

[28]  Tim Berners-Lee,et al.  Weaving The Web: The Original Design And Ultimate Destiny of the World Wide Web , 1999 .

[29]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[30]  Ian Horrocks,et al.  Description logic programs: combining logic programs with description logic , 2003, WWW '03.

[31]  Dieter Fensel,et al.  Ontobroker: or how to enable intelligent access to the WWW , 1998 .

[32]  James A. Hendler,et al.  SHOE: A Blueprint for the Semantic Web , 2003, Spinning the Semantic Web.

[33]  Wei-Ying Ma,et al.  Improving pseudo-relevance feedback in web information retrieval using web page segmentation , 2003, WWW '03.

[34]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[35]  David W. Embley,et al.  Ontology-based extraction and structuring of information from data-rich unstructured documents , 1998, CIKM '98.

[36]  Michael Kifer,et al.  Well-Founded Optimism: Inheritance in Frame-Based Knowledge Bases , 2002, OTM.