Learning to Harvest Information for the Semantic Web

In this paper we describe a methodology for harvesting information from large distributed repositories (e.g. large Web sites) with minimum user intervention. The methodology is based on a combination of information extraction, information integration and machine learning techniques. Learning is seeded by extracting information from structured sources (e.g. databases and digital libraries) or a user-defined lexicon. Retrieved information is then used to partially annotate documents. Annotated documents are used to bootstrap learning for simple Information Extraction (IE) methodologies, which in turn will produce more annotation to annotate more documents that will be used to train more complex IE engines and so on. In this paper we describe the methodology and its implementation in the Armadillo system, compare it with the current state of the art, and describe the details of an implemented application. Finally we draw some conclusions and highlight some challenges and future work.

[1]  Steffen Staab,et al.  S-CREAM: Semiautomatic CREAtion of Metadata , 2002, SAAKM@ECAI.

[2]  Fabio Ciravegna,et al.  Adaptive Information Extraction from Text by Rule Induction and Generalisation , 2001, IJCAI.

[3]  Paul A. Kogut,et al.  AeroDAML: Applying Information Extraction to Generate DAML Annotations from Web Pages , 2001, Semannot@K-CAP 2001.

[4]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[5]  John Mylopoulos,et al.  The Semantic Web - ISWC 2003 , 2003, Lecture Notes in Computer Science.

[6]  Jimmy J. Lin,et al.  Web question answering: is more always better? , 2002, SIGIR '02.

[7]  Hugh Glaser,et al.  Large Scale Acquisition and Maintenance from the Web without Source Access , 2001, Semannot@K-CAP 2001.

[8]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[9]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[10]  Enrico Motta,et al.  Magpie - Towards a Semantic Web Browser , 2003, SEMWEB.

[11]  Carole A. Goble,et al.  Conceptual Open Hypermedia = The Semantic Web? , 2001, SemWeb.

[12]  Arthur Stutt,et al.  MnM: Ontology Driven Semi-automatic and Automatic Support for Semantic Markup , 2002, EKAW.

[13]  Yorick Wilks,et al.  Designing Adaptive Information Extraction for the Semantic Web in Amilcare , 2003 .

[14]  Tom M. Mitchell Extracting targeted data from the web , 2001, KDD '01.

[15]  Ramakrishnan Srikant,et al.  Kdd-2001: Proceedings of the Seventh Acm Sigkdd International Conference on Knowledge Discovery and Data Mining : August 26-29, 2001 San Francisco, Ca, USA , 2002 .

[16]  Oren Etzioni,et al.  Category Translation: Learning to Understand Information on the Internet , 1995, IJCAI.