iASA: Learning to Annotate the Semantic Web

With the advent of the Semantic Web, there is a great need to upgrade existing web content to semantic web content. This can be accomplished through semantic annotations. Unfortunately, manual annotation is tedious, time consuming and error-prone. In this paper, we propose a tool, called iASA, that learns to automatically annotate web documents according to an ontology. iASA is based on the combination of information extraction (specifically, the Similarity-based Rule Learner—SRL) and machine learning techniques. Using linguistic knowledge and optimal dynamic window size, SRL produces annotation rules of better quality than comparable semantic annotation systems. Similarity-based learning efficiently reduces the search space by avoiding pseudo rule generalization. In the annotation phase, iASA exploits ontology knowledge to refine the annotation it proposes. Moreover, our annotation algorithm exploits machine learning methods to correctly select instances and to predict missing instances. Finally, iASA provides an explanation component that explains the nature of the learner and annotator to the user. Explanations can greatly help users understand the rule induction and annotation process, so that they can focus on correcting rules and annotations quickly. Experimental results show that iASA can reach high accuracy quickly.

[1]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[2]  Claudio Giuliano,et al.  A Critical Survey of the Methodology for IE Evaluation , 2004, LREC.

[3]  L. Stein,et al.  OWL Web Ontology Language - Reference , 2004 .

[4]  SoderlandStephen Learning Information Extraction Rules for Semi-Structured and Free Text , 1999 .

[5]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[6]  Ramanathan V. Guha,et al.  A case for automated large-scale semantic annotation , 2003, J. Web Semant..

[7]  Fabio Ciravegna,et al.  (LP) 2 , an Adaptive Algorithm for Information Extraction from Web-related Texts , 2001 .

[8]  John Mylopoulos,et al.  The Semantic Web - ISWC 2003 , 2003, Lecture Notes in Computer Science.

[9]  Arthur Stutt,et al.  MnM: Ontology Driven Semi-automatic and Automatic Support for Semantic Markup , 2002, EKAW.

[10]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[11]  Alexiei Dingli,et al.  Multi-strategy definition of annotation services in Melita , 2003 .

[12]  Asunción Gómez-Pérez,et al.  Six challenges for the Semantic Web , 2002, KR 2002.

[13]  I. V. Ramakrishnan,et al.  Automatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis , 2003, SEMWEB.

[14]  Jeff Heflin,et al.  Searching the Web with SHOE , 2000 .

[15]  Pedro M. Domingos,et al.  iMAP: discovering complex semantic matches between database schemas , 2004, SIGMOD '04.

[16]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[17]  Amit P. Sheth,et al.  Semantic Enhancement Engine: A Modular Document Enhancement Platform for Semantic Applications over Heterogeneous Content , 2002 .

[18]  Hugh Glaser,et al.  Large Scale Acquisition and Maintenance from the Web without Source Access , 2001, Semannot@K-CAP 2001.

[19]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[20]  Craig A. Knoblock,et al.  Automatic Data Extraction from Lists and Tables in Web Sources , 2001 .

[21]  Dayne Freitag,et al.  Boosted Wrapper Induction , 2000, AAAI/IAAI.

[22]  David E. Millard,et al.  Automatic Ontology-Based Knowledge Extraction from Web Documents , 2003, IEEE Intell. Syst..

[23]  Ching-chih Chen,et al.  Automated semantic annotation and retrieval based on sharable ontology and case-based learning techniques , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[24]  Yuval Shahar,et al.  Automatic generation of ontology editors , 1999 .

[25]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[26]  Paul A. Kogut,et al.  AeroDAML: Applying Information Extraction to Generate DAML Annotations from Web Pages , 2001, Semannot@K-CAP 2001.

[27]  Hwee Tou Ng,et al.  A maximum entropy approach to information extraction from semi-structured and free text , 2002, AAAI/IAAI.

[28]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[29]  M. Cali,et al.  Relational learning techniques for natural language information extraction , 1998 .

[30]  Enrico Motta,et al.  Knowledge Extraction by Using an Ontology Based Annotation Tool , 2001, Semannot@K-CAP 2001.

[31]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[32]  Mark Fischetti,et al.  Weaving the web - the original design and ultimate destiny of the World Wide Web by its inventor , 1999 .

[33]  William W. Cohen A structured wrapper induction system for extracting information from semi-structured documents , 2001, IJCAI 2001.

[35]  Philippe Martin,et al.  Embedding Knowledge in Web Documents , 1999, Comput. Networks.

[36]  Cullen Schaffer,et al.  Selecting a classification method by cross-validation , 1993, Machine Learning.

[37]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[38]  Steffen Staab,et al.  S-CREAM: Semiautomatic CREAtion of Metadata , 2002, SAAKM@ECAI.

[39]  Ian Horrocks,et al.  OWL Web Ontology Language Reference-W3C Recommen-dation , 2004 .

[40]  Lei Zhang,et al.  Learning to Generate Semantic Annotation for Domain Specific Sentences , 2001, Semannot@K-CAP 2001.

[41]  Dieter Fensel,et al.  Ontobroker: or how to enable intelligent access to the WWW , 1998 .

[42]  Cullen Schaffer,et al.  Technical Note: Selecting a Classification Method by Cross-Validation , 1993, Machine Learning.

[43]  G Stix,et al.  The mice that warred. , 2001, Scientific American.

[44]  Eric Prud'hommeaux,et al.  Annotea: an open RDF infrastructure for shared Web annotations , 2002, Comput. Networks.

[45]  M. T. Lino,et al.  Proceedings of the 4th International Conference on Language Resources and Evaluation , 2004 .

[46]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[47]  Atanas Kiryakov,et al.  Towards Semantic Web Information Extraction , 2003 .

[48]  Atanas Kiryakov,et al.  KIM - Semantic Annotation Platform , 2003, SEMWEB.

[49]  Steffen Staab,et al.  Annotation for the semantic web , 2003 .

[50]  Li Zhang,et al.  Focused named entity recognition using machine learning , 2004, SIGIR '04.

[51]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[52]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[53]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[54]  Raymond J. Mooney,et al.  Using Soft-Matching Mined Rules to Improve Information Extraction , 2004, AAAI 2004.

[55]  Paul Buitelaar,et al.  Linguistic Annotation for the Semantic Web , 2003 .