Automatic discovery of word semantic relations using paraphrase alignment and distributional lexical semantics analysis

Thesauri, which list the most salient semantic relations between words, have mostly been compiled manually. Therefore, the inclusion of an entry depends on the subjective decision of the lexicographer. As a consequence, those resources are usually incomplete. In this paper, we propose an unsupervised methodology to automatically discover pairs of semantically related words by highlighting their local environment and evaluating their semantic similarity in local and global semantic spaces. This proposal differs from all other research presented so far as it tries to take the best of two different methodologies, i.e. semantic space models and information extraction models. In particular, it can be applied to extract close semantic relations, it limits the search space to few, highly probable options and it is unsupervised.

[1]  Edmond Chow,et al.  New Experiments in Distributional Representations of Synonymy , 2005, CoNLL.

[2]  J. Wiebe Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference , 2000 .

[3]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[4]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[5]  Gregory Grefenstette Automatic Thesaurus Generation from Raw Text using Knowledge-Poor Techniques , 1993 .

[6]  Eneko Agirre,et al.  Clustering WordNet word senses , 2003, RANLP.

[7]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[8]  R. Rapp Word sense discovery based on sense descriptor dissimilarity , 2003, MTSUMMIT.

[9]  João Cordeiro,et al.  Learning Paraphrases from WNS Corpora , 2007, FLAIRS Conference.

[10]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[11]  Magnus Sahlgren,et al.  Vector-Based Semantic Analysis Using Random Indexing for Cross-Lingual Query Expansion , 2001, CLEF.

[12]  Carl Vogel,et al.  Proceedings of the 16th International Conference on Computational Linguistics , 1996, COLING 1996.

[13]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[14]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[15]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[16]  Andrew McCallum,et al.  Proceedings of the Ninth Conference on Computational Natural Language Learning, CoNLL 2005, Ann Arbor, Michigan, USA, June 29-30, 2005 , 2005, CoNLL.

[17]  Ming Zhou,et al.  Identifying Synonyms among Distributionally Similar Words , 2003, IJCAI.

[18]  Reinhard Rapp Utilizing the One-Sense-per-Discourse Constraint for Fully Unsupervised Word Sense Induction and Disambiguation , 2004, LREC.

[19]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[20]  Guillaume Cleuziou,et al.  Biology Based Alignments of Paraphrases for Sentence Compression , 2007, ACL-PASCAL@ACL.

[21]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[22]  Saif M. Mohammad,et al.  Measuring Semantic Distance using Distributional Profiles of Concepts , 2008 .

[23]  Laurie J. Heyer,et al.  Exploring expression data: identification and analysis of coexpressed genes. , 1999, Genome research.

[24]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[25]  Charles L. A. Clarke,et al.  Frequency Estimates for Statistical Word Similarity Measures , 2003, NAACL.

[26]  Abraham Kaplan,et al.  An experimental study of ambiguity and context , 1955, Mech. Transl. Comput. Linguistics.

[27]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[28]  James R. Curran,et al.  Improvements in Automatic Thesaurus Extraction , 2002, ACL 2002.

[29]  Khalil Sima'an,et al.  Proceedings of the Sixth International Language Resources and Evaluation (LREC'08) , 2008 .

[30]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[31]  Cordeiro João,et al.  New Functions for Unsupervised Asymmetrical Paraphrase Detection , 2007 .

[32]  Sharon A. Caraballo Automatic construction of a hypernym-labeled noun hierarchy from text , 1999, ACL.

[33]  W. Charles Contextual correlates of meaning , 2000, Applied Psycholinguistics.

[34]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[35]  José Gabriel Pereira Lopes,et al.  Language Independent Automatic Acquisition of Rigid Multiword Units from Unrestricted Text Corpora , 1999 .

[36]  Kathleen McKeown,et al.  Cut and Paste Based Text Summarization , 2000, ANLP.

[37]  Chris Fox,et al.  Recent Advances in Natural Language Processing III: Selected Papers from RANLP 2003 , 2004 .

[38]  Yves Peirsman,et al.  Modelling Word Similarity: an Evaluation of Automatic Synonymy Extraction Algorithms , 2008, LREC.

[39]  Zellig S. Harris,et al.  Mathematical structures of language , 1968, Interscience tracts in pure and applied mathematics.

[40]  Daniel Jurafsky,et al.  Learning Syntactic Patterns for Automatic Hypernym Discovery , 2004, NIPS.

[41]  Helena Ahonen-Myka,et al.  Probability and expected document frequency of discontinued word sequences : An efficient method for their exact computation , 2005 .

[42]  Eugene Charniak,et al.  Finding Parts in Very Large Corpora , 1999, ACL.

[43]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[44]  Cédric Notredame,et al.  Recent Evolutions of Multiple Sequence Alignment Algorithms , 2007, PLoS Comput. Biol..

[45]  Donald Hindle,et al.  Noun Classification From Predicate-Argument Structures , 1990, ACL.

[46]  David J. Weir,et al.  Characterising Measures of Lexical Distributional Similarity , 2004, COLING.

[47]  Stan Szpakowicz,et al.  Roget's thesaurus and semantic similarity , 2012, RANLP.

[48]  Ralph Grishman,et al.  Grammatically-based automatic word class formation , 1975, Inf. Process. Manag..

[49]  Daniel Jurafsky,et al.  Semantic Taxonomy Induction from Heterogenous Evidence , 2006, ACL.

[50]  George W. Davidson,et al.  Roget's Thesaurus of English Words and Phrases , 1982 .

[51]  Walter Daelemans,et al.  Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4 , 2003 .

[52]  Helena Ahonen-Myka Finding All Maximal Frequent Sequences in Text , 1999 .

[53]  Jeffrey P. Bigham,et al.  Combining independent modules in lexical multiple-choice problems , 2004, RANLP.

[54]  Stefan Bordag Sentence Co-occurrences as Small-world Graphs: A Solution to Automatic Lexical Disambiguation , 2003, CICLing.

[55]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[56]  David Yarowsky,et al.  One Sense Per Discourse , 1992, HLT.