Annotating Structured Data of the Deep Web

An increasing number of databases have become Web accessible through HTML form-based search interfaces. The data units returned from the underlying database are usually encoded into the result pages dynamically for human browsing. For the encoded data units to be machine processable, which is essential for many applications such as deep Web data collection and comparison shopping, they need to be extracted out and assigned meaningful labels. In this paper, we present a multi-annotator approach that first aligns the data units into different groups such that the data in the same group have the same semantics. Then for each group, we annotate it from different aspects and aggregate the different annotations to predict a final annotation label. An annotation wrapper for the search site is automatically constructed and can be used to annotate new result pages from the same site. Our experiments indicate that the proposed approach is highly effective.

[1]  Jeff Heflin,et al.  Searching the Web with SHOE , 2000 .

[2]  Clement T. Yu,et al.  Constructing Interface Schemas for Search Interfaces of Web Databases , 2005, WISE.

[3]  Wei-Ying Ma,et al.  Instance-based Schema Matching for Web Databases by Domain-specific Query Probing , 2004, VLDB.

[4]  Clement T. Yu,et al.  WISE-Integrator: A System for Extracting and Integrating Complex Web Search Interfaces of the Deep Web , 2005, VLDB.

[5]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[6]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[7]  Steffen Staab,et al.  On deep annotation , 2003, WWW '03.

[8]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[9]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[10]  Clement T. Yu,et al.  Automatic integration of Web search interfaces with WISE-Integrator , 2004, The VLDB Journal.

[11]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[12]  Keishi Tajima,et al.  SIGMOD Conference 2002 , 2002 .

[13]  I. V. Ramakrishnan,et al.  Bootstrapping semantic annotation for content-rich HTML documents , 2005, 21st International Conference on Data Engineering (ICDE'05).

[14]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[15]  Hyoil Han,et al.  Survey of semantic annotation platforms , 2005, SAC '05.

[16]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[17]  W. Bruce Croft Combining Approaches to Information Retrieval , 2002 .

[18]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[19]  Dayne Freitag,et al.  Multistrategy Learning for Information Extraction , 1998, ICML.

[20]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[21]  Valter Crescenzi,et al.  Automatic annotation of data extracted from large Web sites , 2003, WebDB.

[22]  Steffen Staab,et al.  Authoring and annotation of web pages in CREAM , 2002, WWW.

[23]  Atanas Kiryakov,et al.  KIM - Semantic Annotation Platform , 2003, SEMWEB.

[24]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[25]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[26]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[27]  Vipul Kashyap,et al.  The Semantic Web: Semantics for Data on the Web , 2003, VLDB.

[28]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[29]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[30]  Salvatore J. Stolfo,et al.  Experiments on multistrategy learning by meta-learning , 1993, CIKM '93.

[31]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[32]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .