Unsupervised Learning of Coherent and General Semantic Classes for Entity Aggregates

This paper addresses the task of semantic class learning by introducing a new methodology to identify the set of semantic classes underlying an aggregate of instances (i.e, a set of nominal phrases observed as a particular semantic role in a collection of text documents). The aim is to identify a set of semantically coherent (i.e., interpretable) and general enough classes capable of accurately describing the full extension that the set of instances is intended to represent. Thus, the set of learned classes is then used to devise a generative model for entity categorization tasks such as semantic class induction. The proposed methods are completely unsupervised and rely on an (unlabeled) open-domain collection of text documents used as the source of background knowledge. We demonstrate our proposal on a collection of news stories. Specifically, we model the set of classes underlying the predicate arguments in a Proposition Store built from the news. The experiments carried out show significant improvements over a (baseline) generative model of entities based on latent classes that is defined by means of Hierarchical Dirichlet Processes.

[1]  Eric Fosler-Lussier,et al.  UNSUPERVISED COMBINATION OF METRICS FOR SEMANTIC CLASS INDUCTION , 2006, 2006 IEEE Spoken Language Technology Workshop.

[2]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[3]  Ellen Riloff,et al.  Semantic Class Learning from the Web with Hyponym Pattern Linkage Graphs , 2008, ACL.

[4]  Ellen Riloff,et al.  Inducing Domain-Specific Semantic Class Taggers from (Almost) Nothing , 2010, ACL.

[5]  Xiaojie Yuan,et al.  Corpus-based Semantic Class Mining: Distributional vs. Pattern-Based Approaches , 2010, COLING.

[6]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[7]  James Pustejovsky,et al.  Automating Temporal Annotation with TARSQI , 2005, ACL.

[8]  Eduard H. Hovy,et al.  Filling Knowledge Gaps in Text for Machine Reading , 2010, COLING.

[9]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[10]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[11]  David Buttler,et al.  Exploring Topic Coherence over Many Models and Many Topics , 2012, EMNLP.

[12]  Peter Clark,et al.  Large-scale extraction and use of knowledge from text , 2009, K-CAP '09.

[13]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[14]  Francis R. Bach,et al.  Hidden Markov tree models for semantic class induction , 2013, CoNLL.

[15]  Christopher D. Manning,et al.  The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.