Roget's Thesaurus as a Lexical Resource for Natural Language Processing

WordNet proved that it is possible to construct a large-scale electronic lexical database on the principles of lexical semantics. It has been accepted and used extensively by computational linguists ever since it was released. Inspired by WordNet's success, we propose as an alternative a similar resource, based on the 1987 Penguin edition of Roget's Thesaurus of English Words and Phrases. Peter Mark Roget published his first Thesaurus over 150 years ago. Countless writers, orators and students of the English language have used it. Computational linguists have employed Roget's for almost 50 years in Natural Language Processing, however hesitated in accepting Roget's Thesaurus because a proper machine tractable version was not available. This dissertation presents an implementation of a machine-tractable version of the 1987 Penguin edition of Roget's Thesaurus - the first implementation of its kind to use an entire current edition. It explains the steps necessary for taking a machine-readable file and transforming it into a tractable system. This involves converting the lexical material into a format that can be more easily exploited, identifying data structures and designing classes to computerize the Thesaurus. Roget's organization is studied in detail and contrasted with WordNet's. We show two applications of the computerized Thesaurus: computing semantic similarity between words and phrases, and building lexical chains in a text. The experiments are performed using well-known benchmarks and the results are compared to those of other systems that use Roget's, WordNet and statistical techniques. Roget's has turned out to be an excellent resource for measuring semantic similarity; lexical chains are easily built but more difficult to evaluate. We also explain ways in which Roget's Thesaurus and WordNet can be combined.

[1]  Michael Sussna,et al.  Word sense disambiguation for free-text indexing using a massive semantic network , 1993, CIKM '93.

[2]  James R. Driscoll,et al.  The QA System , 1992, TREC.

[3]  Laurence Urdang The Basic Book of Synonyms and Antonyms , 1979 .

[4]  Stan Szpakowicz,et al.  Roget's Thesaurus: a Lexical Resource to Treasure , 2012, ArXiv.

[5]  Sanda M. Harabagiu,et al.  The Informative Role of WordNet in Open-Domain Question Answering , 2004, HLT-NAACL 2004.

[6]  Kathleen F. McCoy,et al.  Efficiently Computed Lexical Chains as an Intermediate Representation for Automatic Text Summarization , 2002, CL.

[7]  Oi Yee Kwong Aligning WordNet with Additional Lexical Resources , 1998, WordNet@ACL/COLING.

[8]  Anne H. Soukhanov Roget's II : the new thesaurus , 1988 .

[9]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[10]  Yllias Chali Topic Detection Using Lexical Chains , 2001, IEA/AIE.

[11]  Christiane Fellbaum,et al.  Nouns in WordNet , 1998 .

[12]  Yllias Chali,et al.  Text Summarization Using Lexical Chains , 2001 .

[13]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[14]  Ted Pedersen Semantic Distance in WordNet Package , 2002 .

[15]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[16]  Ellman Jeremy,et al.  Using Roget's Thesaurus to Determine the Similarity of Texts , 2010 .

[17]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[18]  Graeme Hirst,et al.  Lexical chains as representations of context for the detection and correction of malapropisms , 1995 .

[19]  Rada Mihalcea,et al.  Word Sense Disambiguation based on Semantic Density , 1998, WordNet@ACL/COLING.

[20]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[21]  John Tait,et al.  On the Generality of Thesaurally derived Lexical Links , 2000 .

[22]  Janyce Wiebe,et al.  Classifying Functional Relations in Factotum via WordNet Hypernym Associations , 2003, CICLing.

[23]  Michael Halliday,et al.  Cohesion in English , 1976 .

[24]  Margaret Masterman,et al.  The thesaurus in syntax and semantics , 1957, Mech. Transl. Comput. Linguistics.

[25]  Walter A. Sedelow,et al.  Recent Model-Based and Model-Related Studies of a Large Scale Lexical Resource [Roget's Thesaurus] , 1992, COLING.

[26]  John Tait,et al.  Roget’s Thesaurus: An additional knowledge source for Textual CBR? , 2000 .

[27]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[28]  Regina Barzilay,et al.  Lexical Chains for Summarization , 1997 .

[29]  Douglas B. Lenat Computer Software for Intelligent Systems. , 1984 .

[30]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[31]  George W. Davidson,et al.  Roget's Thesaurus of English Words and Phrases , 1982 .

[32]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[33]  Ralph Grishman,et al.  The Comlex Syntax Project: The First Year , 1994, HLT.

[34]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[35]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[36]  Okumura Manabu,et al.  Word Sense Disambiguation and Text Segmentation Based on Lexical Cohesion , 1994, COLING.

[37]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[38]  Graeme Hirst,et al.  Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures , 2004 .

[39]  P. Cassidy An Investigation of the Semantic Relations in the Roget ’ s Thesaurus : Preliminary Results , 2010 .

[40]  Paul Procter,et al.  Longman Dictionary of Contemporary English , 1978 .

[41]  Nicoletta Calzolari,et al.  Automating the lexicon : research and practice in a multilingual environment , 1997 .

[42]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[43]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[44]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[45]  Stan Matwin,et al.  A WordNet-based Algorithm for Word Sense Disambiguation , 1995, IJCAI.

[46]  Stan Szpakowicz,et al.  The Design and Implementation of an Electronic Lexical Knowledge Base , 2001, Canadian Conference on AI.

[47]  J. I. Rodale The Synonym Finder , 1958 .

[48]  James E. Houston Thesaurus of ERIC Descriptors , 2001 .

[49]  Graeme Hirst,et al.  Near-Synonymy and Lexical Choice , 2002, CL.

[50]  Stephen J. Green Lexical semantics and automatic hypertext construction , 1999, CSUR.

[51]  Kathleen F. McCoy,et al.  Efficient text summarization using lexical chains , 2000, IUI '00.

[52]  Stan Szpakowicz,et al.  Not as Easy as It Seems: Automating the Construction of Lexical Chains Using Roget's Thesaurus , 2003, AI.

[53]  Stan Szpakowicz,et al.  Roget's thesaurus and semantic similarity , 2012, RANLP.