A Supervised Method of Feature Weighting for Measuring Semantic Relatedness

The clustering of related words is crucial for a variety of Natural Language Processing applications. Many known techniques of word clustering use the context of a word to determine its meaning. Words which frequently appear in similar contexts are assumed to have similar meanings. Word clustering usually applies the weighting of contexts, based on some measure of their importance. One of the most popular measures is Pointwise Mutual Information. It increases the weight of contexts where a word appears regularly but other words do not, and decreases the weight of contexts where many words may appear. Essentially, it is unsupervised feature weighting. We present a method of supervised feature weighting. It identifies contexts shared by pairs of words known to be semantically related or unrelated, and then uses Pointwise Mutual Information to weight these contexts on how well they indicate closely related words. We use Roget's Thesaurus as a source of training and evaluation data. This work is as a step towards adding new terms to Roget's Thesaurus automatically, and doing so with high confidence.

[1]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[2]  Dan Roth,et al.  Context Sensitive Paraphrasing with a Global Unsupervised Classifier , 2007, ECML.

[3]  Stefan Evert,et al.  The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[4]  Gerda Ruge,et al.  Automatic Detection of Thesaurus relations for Information Retrieval Applications , 1997, Foundations of Computer Science: Potential - Theory - Cognition.

[5]  In-Ho Kang,et al.  Query type classification for web document retrieval , 2003, SIGIR.

[6]  Patrick Pantel,et al.  Clustering by committee , 2003 .

[7]  Alistair Kennedy,et al.  Evaluating Roget's Thesauri , 2008, ACL.

[8]  George W. Davidson,et al.  Roget's Thesaurus of English Words and Phrases , 1982 .

[9]  Stan Szpakowicz,et al.  Corpus-based Semantic Relatedness for the Construction of Polish WordNet , 2008, LREC.

[10]  Peter McBurney,et al.  Thirty-First Australasian Computer Science Conference (ACSC 2008) , 2008 .

[11]  Daniel Jurafsky,et al.  Semantic Taxonomy Induction from Heterogenous Evidence , 2006, ACL.

[12]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[13]  Stan Szpakowicz,et al.  Automatic Selection of Heterogeneous Syntactic Features in Semantic Similarity of Polish Nouns , 2007, TSD.

[14]  Wen-tau Yih Learning Term-weighting Functions for Similarity Measures , 2009, EMNLP.

[15]  Yasuhiro Ogawa,et al.  Supervised Synonym Acquisition Using Distributional Features and Syntactic Patterns , 2009 .

[16]  James R. Curran,et al.  Improvements in Automatic Thesaurus Extraction , 2002, ACL 2002.

[17]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[18]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[19]  Maciej Piasecki,et al.  Parallel, massive processing in SuperMatrix—A general tool for distributional semantic analysis of corpus , 2010, Proceedings of the International Multiconference on Computer Science and Information Technology.

[20]  David J. Weir,et al.  Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity , 2005, CL.

[21]  Wen-tau Yih,et al.  Adaptive near-duplicate detection via similarity learning , 2010, SIGIR.

[22]  Dekang Lin,et al.  Dependency-Based Evaluation of Minipar , 2003 .

[23]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[24]  Carolyn J. Crouch,et al.  A cluster-based approach to thesaurus construction , 1988, SIGIR '88.

[25]  D. Powers,et al.  Automatic thesaurus construction , 2008, ACSC.

[26]  Adam Kilgarriff,et al.  An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments) , 2007, ACL.

[27]  Michael L. Littman,et al.  Unsupervised Learning of Semantic Orientation from a Hundred-Billion-Word Corpus , 2002, ArXiv.