AllRight: Automatic Ontology Instantiation from Tabular Web Documents

The process of instantiating an ontology with high-quality and up-to-date instance information manually is both time consuming and prone to error. Automatic ontology instantiation from Web sources is one of the possible solutions to this problem and aims at the computer supported population of an ontology through the exploitation of (redundant) information available on the Web. In this paper we present ALLRIGHT, a comprehensive ontology instantiating system. In particular, the techniques implemented in ALLRIGHT are designed for application scenarios, in which the desired instance information is given in the form of tables and for which existing Information Extraction (IE) approaches based on statistical or natural language processing methods are not directly applicable. Within ALLRIGHT, we have therefore developed new techniques for dealing with tabular instance data and combined these techniques with existing methods. The system supports all necessary steps for ontology instantiation, i.e. web crawling, name extraction, document clustering as well as fact extraction and validation. ALLRIGHT has been successfully evaluated in the popular domains of digital cameras and notebooks leading to a about eighty percent accuracy of the extracted facts given only a very limited amount of seed knowledge.

[1]  Paul Resnick,et al.  Recommender systems , 1997, CACM.

[2]  David E. Millard,et al.  Automatic Ontology-Based Knowledge Extraction from Web Documents , 2003, IEEE Intell. Syst..

[3]  Anastasia Ailamaki,et al.  Challenges inbuilding a DBMS Resource Advisor , 2006, IEEE Data Eng. Bull..

[4]  Gerhard Friedrich,et al.  NameIt: Extraction of product names , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[5]  Enrico Motta,et al.  The Semantic Web - ISWC 2005, 4th International Semantic Web Conference, ISWC 2005, Galway, Ireland, November 6-10, 2005, Proceedings , 2005, SEMWEB.

[6]  Gerhard Friedrich,et al.  An Integrated Environment for the Development of Knowledge-Based Recommender Applications , 2006, Int. J. Electron. Commer..

[7]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[8]  Felix Naumann,et al.  Data Fusion in Three Steps: Resolving Schema, Tuple, and Value Inconsistencies , 2006, IEEE Data Eng. Bull..

[9]  Marcus Herzog,et al.  Visually guided bottom-up table detection and segmentation in web documents , 2006, WWW '06.

[10]  Lars Schmidt-Thieme,et al.  Guest Editors' Introduction: Recommender Systems , 2007, IEEE Intell. Syst..

[11]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[12]  Sriram Raghavan,et al.  Avatar Information Extraction System , 2006, IEEE Data Eng. Bull..

[13]  Petra Perner,et al.  Advances in Data Mining , 2002, Lecture Notes in Computer Science.

[14]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[15]  Oren Etzioni,et al.  Extracting Product Features and Opinions from Reviews , 2005, HLT.

[16]  Christos Faloutsos,et al.  Automatic multimedia cross-modal correlation discovery , 2004, KDD.

[17]  V. Karkaletsis,et al.  Cross-lingual Information Extraction from Web pages : the use of a general-purpose Text Engineering Platform , 2003 .

[18]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[19]  Gerhard Friedrich,et al.  A General Diagnosis Method for Ontologies , 2005, SEMWEB.

[20]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[21]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[22]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[23]  Steffen Staab,et al.  Gimme' the context: context-driven automatic semantic annotation with C-PANKOW , 2005, WWW '05.