Toward Completeness in Concept Extraction and Classification

Many algorithms extract terms from text together with some kind of taxonomic classification (is-a) link. However, the general approaches used today, and specifically the methods of evaluating results, exhibit serious shortcomings. Harvesting without focusing on a specific conceptual area may deliver large numbers of terms, but they are scattered over an immense concept space, making Recall judgments impossible. Regarding Precision, simply judging the correctness of terms and their individual classification links may provide high scores, but this doesn't help with the eventual assembly of terms into a single coherent taxonomy. Furthermore, since there is no correct and complete gold standard to measure against, most work invents some ad hoc evaluation measure. We present an algorithm that is more precise and complete than previous ones for identifying from web text just those concepts 'below' a given seed term. Comparing the results to WordNet, we find that the algorithm misses terms, but also that it learns many new terms not in WordNet, and that it classifies them in ways acceptable to humans but different from WordNet.

[1]  Dominic Widdows,et al.  A Graph Model for Unsupervised Lexical Acquisition , 2002, COLING.

[2]  James Mayfield,et al.  Learning Named Entity Hyponyms for Question Answering , 2008, IJCNLP.

[3]  Ellen Riloff,et al.  Semantic Class Learning from the Web with Hyponym Pattern Linkage Graphs , 2008, ACL.

[4]  Marius Pasca,et al.  Acquisition of categorized named entities for web search , 2004, CIKM '04.

[5]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[6]  Sharon A. Caraballo Automatic construction of a hypernym-labeled noun hierarchy from text , 1999, ACL.

[7]  Daniel Jurafsky,et al.  Learning Syntactic Patterns for Automatic Hypernym Discovery , 2004, NIPS.

[8]  Brian Roark,et al.  Noun-phrase co-occurrence statistics for semi-automatic semantic lexicon construction , 2000, COLING.

[9]  Wayne Niblack,et al.  Sentiment mining in WebFountain , 2005, 21st International Conference on Data Engineering (ICDE'05).

[10]  Eduard H. Hovy,et al.  Fine Grained Classification of Named Entities , 2002, COLING.

[11]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[12]  Eugene Charniak,et al.  Finding Parts in Very Large Corpora , 1999, ACL.

[13]  Patrick Pantel,et al.  Automatically Labeling Semantic Classes , 2004, NAACL.

[14]  Oren Etzioni,et al.  What Is This, Anyway: Automatic Hypernym Discovery , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[15]  Brian Roark,et al.  Noun-Phrase Co-Occurence Statistics for Semi-Automatic Semantic Lexicon Construction , 1998, COLING-ACL.

[16]  Gideon S. Mann Fine-Grained Proper Noun Ontologies for Question Answering , 2002, COLING 2002.

[17]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[18]  Marius Pasca,et al.  Weakly-supervised discovery of named entities using web search queries , 2007, CIKM '07.

[19]  Ellen Riloff,et al.  A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts , 2002, EMNLP.

[20]  Johanna Völker,et al.  Towards large-scale, open-domain and ontology-based named entity classification , 2005 .

[21]  Eleanor Rosch,et al.  Principles of Categorization , 1978 .

[22]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[23]  Ellen Riloff,et al.  A Corpus-Based Approach for Building Semantic Lexicons , 1997, EMNLP.

[24]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[25]  Ellen Riloff,et al.  Learning and Evaluating the Content and Structure of a Term Taxonomy , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[26]  Dan I. Moldovan,et al.  Learning Semantic Constraints for the Automatic Discovery of Part-Whole Relations , 2003, NAACL.

[27]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .