A Confidence-Weighted Metric for Unsupervised Ontology Population from Web Texts

Knowledge engineers have had difficulty in automatically constructing and populating domain ontologies, mainly due to the well-known knowledge acquisition bottleneck. In this paper, we attempt to alleviate this problem by proposing an unsupervised approach for extracting class instances using the web as a big corpus and exploring linguistic patterns to identify and extract ontological class instances. The prototype implementation uses shallow syntactic parsing for disambiguation issues. In addition, we propose a confidence-weighted metric based on different versions of the classical PMI metric, WordNet similarity measures, and heuristics to calculate the final confidence score that can altogether improve the ranking of candidate instances retrieved by the system. We conducted preliminary experiments comparing the proposed confidence metric against some versions of the PMI metric. We obtained promising results for the final ranking of the candidate instances, achieving a gain in precision up to 24%.

[1]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[2]  Georgios Paliouras,et al.  Knowledge-Driven Multimedia Information Extraction and Ontology Evolution - Bridging the Semantic Gap , 2011, Knowledge-Driven Multimedia Information Extraction and Ontology Evolution.

[3]  G Stix,et al.  The mice that warred. , 2001, Scientific American.

[4]  Eric Brill Processing Natural Language without Natural Language Processing , 2003, CICLing.

[5]  Steffen Staab,et al.  Gimme' the context: context-driven automatic semantic annotation with C-PANKOW , 2005, WWW '05.

[6]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[7]  Dejing Dou,et al.  Ontology-based information extraction , 2010 .

[8]  Michael J. Cafarella,et al.  Ontology-driven, unsupervised instance population , 2008, J. Web Semant..

[9]  Alexiei Dingli,et al.  Integrating Information to Bootstrap Information Extraction from Web Sites , 2003, IIWeb.

[10]  Ted Pedersen,et al.  Information Content Measures of Semantic Similarity Perform Better Without Sense-Tagged Text , 2010, NAACL.

[11]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[12]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[13]  Emilly Budlong Multimedia Information Extraction , 2007 .

[14]  Daniel S. Weld,et al.  Autonomously semantifying wikipedia , 2007, CIKM '07.

[15]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[16]  Philipp Cimiano,et al.  Ontology learning and population from text - algorithms, evaluation and applications , 2006 .

[17]  Georgios Paliouras,et al.  Ontology Population and Enrichment: State of the Art , 2011, Knowledge-Driven Multimedia Information Extraction and Ontology Evolution.

[18]  Steffen Staab,et al.  Towards the self-annotating web , 2004, WWW '04.

[19]  Dejing Dou,et al.  Ontology-based information extraction: An introduction and a survey of current approaches , 2010, J. Inf. Sci..