Incorporating Dictionary and Corpus Information into a Context Vector Measure of Semantic Relatednes

This is to certify that I have examined this copy of master's thesis by SIDDHARTH PATWARDHAN and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the final examining committee have been made. Acknowledgments I would like to take this opportunity to thank a number of people, without whose support and encouragement this thesis would not have been possible. Firstly, I would like to thank my advisor Dr. Pedersen for being so thorough and patient, and for seeing me through this research till the end. I would like to thank my committee members, Dr. Gallian and Dr. Turner, for going over the thesis so carefully and for their insightful suggestions. I would also like to thank Bano, whose work we built upon and who was full of ideas throughout. I thank my fellow NLP group members – Saif, Bridget and Amruta – for their ideas and suggestions and my colleague Navdeep for proofreading and providing her thoughts on the thesis. I am grateful to Jason Rennie for providing a wonderful interface to WordNet and to Mona Diab for her feedback on the measures. I am also grateful to Diana Inkpen for her insights on the Vector measure. Abstract Humans are able to judge the relatedness of words (concepts) relatively easily, and are often in general agreement as to how related two words are. For example, few would disagree that " pencil " is more related to " paper " than it is to " boat ". Miller and Charles (1991) attribute this human perception of relatedness to the overlap of contextual representations of words in the human mind, and there is at least some understanding of how humans are able to perform this task. However, it remains an open question as to how to create automatic computational methods that assign relatedness values or scores to pairs of concepts. A number of measures of relatedness have been proposed, most of them relying on information taken from the lexical database WordNet, and possibly augmented with corpus based statistics. In this thesis we study a number of such measures, and offer various refinements to those proposed We then compare these measures along with three others in the context of a human relatedness study and in word sense disambiguation experiments. We find that the measures of Jiang and Conrath (1997) and Banerjee and Pedersen …

[1]  Eneko Agirre,et al.  Word Sense Disambiguation using Conceptual Density , 1996, COLING.

[2]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[3]  George A. Miller,et al.  A Semantic Concordance , 1993, HLT.

[4]  W. Nelson Francis,et al.  FREQUENCY ANALYSIS OF ENGLISH USAGE: LEXICON AND GRAMMAR , 1983 .

[5]  Graeme Hirst,et al.  Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures , 2004 .

[6]  Yoshihiko Nitta,et al.  Co-Occurrence Vectors From Corpora vs. Distance Vectors From Dictionaries , 1994, COLING.

[7]  Ted Pedersen,et al.  Using Measures of Semantic Relatedness for Word Sense Disambiguation , 2003, CICLing.

[8]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[9]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[10]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[11]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[12]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[13]  Hideki Kozima,et al.  Similarity between Words Computed by Spreading Activation on an English Dictionary , 1993, EACL.

[14]  Christiane Fellbaum,et al.  Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms , 1998 .

[15]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[16]  Graeme Hirst,et al.  Automatic Sense Disambiguation of the Near-Synonyms in a Dictionary Entry , 2003, CICLing.

[17]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[18]  Alexander Budanitsky,et al.  Lexical Semantic Relatedness and Its Application in Natural Language Processing , 1999 .

[19]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[20]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[21]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[22]  Michael Sussna,et al.  Word sense disambiguation for free-text indexing using a massive semantic network , 1993, CIKM '93.

[23]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[24]  David Yarowsky,et al.  One Sense per Collocation , 1993, HLT.