Distinguishing antonymy, synonymy and hypernymy with distributional and distributed vector representations and neural networks

In the last decade, computational models that distinguish semantic relations have become crucial for many applications in Natural Language Processing (NLP), such as machine translation, question answering, sentiment analysis, and so on. These computational models typically distinguish semantic relations by either representing semantically related words as vector representations in the vector space, or using neural networks to classify semantic relations. In this thesis, we mainly focus on the improvement of such computational models. Specifically, the goal of this thesis is to address the tasks of distinguishing antonymy, synonymy, and hypernymy. For the task of distinguishing antonymy and synonymy, we propose two approaches. In the first approach, we focus on improving both families of word vector representations, which are distributional and distributed vector representations. Regarding the improvement of distributional vector representation, we propose a novel weighted feature for constructing word vectors by relying on distributional lexical contrast, a feature capable of differentiating between antonymy and synonymy. In terms of the improvement of distributed vector representations, we propose a neural model to learn word vectors by integrating distributional lexical contrast into the objective function of the neural model. The resulting word vectors can distinguish antonymy from synonymy and predict degrees of word similarity. In the second approach, we aim to use lexico-syntactic patterns to classify antonymy and synonymy. To do so, we propose two pattern-based neural networks to distinguish antonymy from synonymy. The lexico-syntactic patterns are induced from the syntactic parse trees and then encoded as vector representations by neural networks. As a result, the two pattern-based neural networks improve performance over prior pattern-based methods. For the tasks of distinguishing hypernymy, we propose a novel neural model to learn hierarchical embeddings for hypernymy detection and directionality. The hierarchical embeddings are learned according to two underlying aspects (i) that the similarity of hypernymy is higher than similarity of other relations, and (ii) that the distributional hierarchy is generated between hyponyms and hypernyms. The experimental results show that hierarchical embeddings significantly outperform state-of-the-art word embeddings. In order to improve word embeddings for measuring semantic similarity and relatedness, we propose two neural models to learn word denoising embeddings by filtering noise from original word embeddings without using any external resources. Two proposed neural models receive original word embeddings as inputs and learn denoising matrices to filter noise from original word embeddings. Word denoising embeddings achieve the improvement against original word embeddings over tasks of semantic similarity and relatedness. Furthermore, rather than using English, we also shift the focus on evaluating the performance of computational models to Vietnamese. To that effect, we introduce two novel datasets of (dis-)similarity and relatedness for Vietnamese. We then make use of computational models to verify the two datasets and to observe their performance in being adapted to Vietnamese. The results show that computational models exhibit similar behaviour in the two Vietnamese datasets as in the corresponding English datasets.

[1]  Phil Blunsom,et al.  Compositional Morphology for Word Representations and Language Modelling , 2014, ICML.

[2]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[3]  Angeliki Lazaridou,et al.  A Multitask Objective to Inject Lexical Contrast into Distributional Semantics , 2015, ACL.

[4]  Alessandro Lenci,et al.  How we BLESSed distributional semantic evaluation , 2011, GEMS.

[5]  Ming Zhou,et al.  Identifying Synonyms among Distributionally Similar Words , 2003, IJCAI.

[6]  Graeme Hirst,et al.  Computing Lexical Contrast , 2013, CL.

[7]  Felix Hill,et al.  HyperLex: A Large-Scale Evaluation of Graded Lexical Entailment , 2016, CL.

[8]  Ido Dagan,et al.  The Distributional Inclusion Hypotheses and Lexical Entailment , 2005, ACL.

[9]  J. Deese The structure of associations in language and thought , 1966 .

[10]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[11]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[12]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[13]  Bernice W. Polemis Nonparametric Statistics for the Behavioral Sciences , 1959 .

[14]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[15]  Ido Dagan,et al.  Improving Hypernymy Detection with an Integrated Path-based and Distributional Method , 2016, ACL.

[16]  Stephen Clark,et al.  Exploiting Image Generality for Lexical Entailment Detection , 2015, ACL.

[17]  Laura Rimell,et al.  Distributional Lexical Entailment by Topic Coherence , 2014, EACL.

[18]  Wenlin Chen,et al.  Strategies for Training Large Vocabulary Neural Language Models , 2015, ACL.

[19]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[20]  Chu-Ren Huang,et al.  Taking Antonymy Mask off in Vector Space , 2014, PACLIC.

[21]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[22]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[23]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[24]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[25]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[26]  Sabine Schulte im Walde,et al.  Uncovering Distributional Differences between Synonyms and Antonyms in a Word Space Model , 2013, IJCNLP.

[27]  Chris Callison-Burch,et al.  Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases , 2009, EMNLP.

[28]  Iryna Gurevych,et al.  Thinking beyond the nouns - computing semantic relatedness across parts of speech , 2006 .

[29]  Sabine Schulte im Walde,et al.  Pattern-Based Distinction of Paradigmatic Relations for German Nouns, Verbs, Adjectives , 2013, GSCL.

[30]  Gemma Boleda,et al.  Inclusive yet Selective: Supervised Distributional Hypernymy Detection , 2014, COLING.

[31]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[32]  Ido Dagan,et al.  Articles: Bootstrapping Distributional Feature Vector Quality , 2009, CL.

[33]  Roy Schwartz,et al.  Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction , 2015, CoNLL.

[34]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[35]  Stefan Müller,et al.  Exploring Vector Space Models to Predict the Compositionality of German Noun-Noun Compounds , 2013, *SEMEVAL.

[36]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[37]  Yann LeCun,et al.  Structured sparse coding via lateral inhibition , 2011, NIPS.

[38]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[39]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[40]  J. Firth Papers in linguistics , 1958 .

[41]  Sanja Fidler,et al.  Order-Embeddings of Images and Language , 2015, ICLR.

[42]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[43]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[44]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[45]  D. Gentner Why verbs are hard to learn , 2006 .

[46]  Katrin Erk,et al.  Flexible, Corpus-Based Modelling of Human Plausibility Judgements , 2007, EMNLP.

[47]  Ngoc Thang Vu,et al.  Integrating Distributional Lexical Contrast into Word Embeddings for Antonym-Synonym Distinction , 2016, ACL.

[48]  Mirella Lapata,et al.  Dependency-Based Construction of Semantic Space Models , 2007, CL.

[49]  Mathias Rossignol,et al.  A lexicon for Vietnamese language processing , 2007, Lang. Resour. Evaluation.

[50]  Angeliki Lazaridou,et al.  Fish Transporters and Miracle Homes: How Compositional Distributional Semantics can Help NP Parsing , 2013, EMNLP.

[51]  Van-Lam Pham,et al.  A Two-Phase Approach for Building Vietnamese WordNet , 2016, GWC.

[52]  Omer Levy,et al.  Do Supervised Distributional Methods Really Learn Lexical Inference Relations? , 2015, NAACL.

[53]  Graeme Hirst,et al.  Non-Classical Lexical Semantic Relations , 2004, Proceedings of the HLT-NAACL Workshop on Computational Lexical Semantics - CLS '04.

[54]  Raffaella Bernardi,et al.  Entailment above the word level in distributional semantics , 2012, EACL.

[55]  David J. Weir,et al.  Characterising Measures of Lexical Distributional Similarity , 2004, COLING.

[56]  E. Clark Conventionality and contrast: Pragmatic principles with lexical consequences. , 1992 .

[57]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[58]  Roland Schäfer,et al.  Processing and querying large web corpora with the COW14 architecture , 2015 .

[59]  Gemma Boleda,et al.  Distributional Semantics in Technicolor , 2012, ACL.

[60]  David J. Weir,et al.  Learning to Distinguish Hypernyms and Co-Hyponyms , 2014, COLING.

[61]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[62]  Michael Roth,et al.  Combining Word Patterns and Discourse Markers for Paradigmatic Relation Classification , 2014, ACL.

[63]  Ngoc Thang Vu,et al.  Hierarchical Embeddings for Hypernymy Detection and Directionality , 2017, EMNLP.

[64]  Yang Zhang,et al.  Exploring Distributional Similarity Based Models for Query Spelling Correction , 2006, ACL.

[65]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[66]  Anh-Cuong Le,et al.  A hybrid approach to Vietnamese word segmentation , 2016, 2016 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future (RIVF).

[67]  Ngoc Thang Vu,et al.  Neural-based Noise Filtering from Word Embeddings , 2016, COLING.

[68]  G. Miller,et al.  Contexts of antonymous adjectives , 1989, Applied Psycholinguistics.

[69]  Iryna Gurevych,et al.  Using the Structure of a Conceptual Network in Computing Semantic Relatedness , 2005, IJCNLP.

[70]  Chu-Ren Huang,et al.  EVALution 1.0: an Evolving Semantic Dataset for Training and Evaluation of Distributional Semantic Models , 2015, LDL@IJCNLP.

[71]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[72]  Roland Schäfer,et al.  Building Large Corpora from the Web Using a New Efficient Tool Chain , 2012, LREC.

[73]  Patrick Pantel,et al.  DIRT @SBT@discovery of inference rules from text , 2001, KDD '01.

[74]  Kathleen McKeown,et al.  Classifying Taxonomic Relations between Pairs of Wikipedia Articles , 2013, IJCNLP.

[75]  G. Miller,et al.  Semantic networks of english , 1991, Cognition.

[76]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[77]  Stephen Clark,et al.  A Systematic Study of Semantic Vector Space Model Parameters , 2014, CVSC@EACL.

[78]  Haixun Wang,et al.  Learning Term Embeddings for Hypernymy Identification , 2015, IJCAI.

[79]  Chu-Ren Huang,et al.  Unsupervised Measure of Word Similarity: How to Outperform Co-Occurrence and Vector Cosine in VSMs , 2016, AAAI.

[80]  Thomas Eckart,et al.  Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages , 2012, LREC.

[81]  Roi Reichart,et al.  Judgment Language Matters: Multilingual Vector Space Models for Judgment Language Aware Lexical Semantics , 2015, ArXiv.

[82]  Alessandro Lenci,et al.  Identifying hypernyms in distributional semantic spaces , 2012, *SEMEVAL.

[83]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[84]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[85]  G. Murphy,et al.  The Big Book of Concepts , 2002 .

[86]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[87]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[88]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[89]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[90]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[91]  Heike Adel,et al.  Using Mined Coreference Chains as a Resource for a Semantic Task , 2014, EMNLP.

[92]  Qin Lu,et al.  Chasing Hypernyms in Vector Spaces with Entropy , 2014, EACL.

[93]  Ngoc Thang Vu,et al.  Introducing two Vietnamese Datasets for Evaluating Semantic Models of (Dis-)Similarity and Relatedness , 2018, NAACL-HLT.

[94]  Slava M. Katz,et al.  Co-Occurrences of Antonymous Adjectives and Their Contexts , 1991, Comput. Linguistics.

[95]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[96]  Yann LeCun,et al.  Learning Fast Approximations of Sparse Coding , 2010, ICML.

[97]  Jianfeng Gao,et al.  Indirect-HMM-based Hypothesis Alignment for Combining Outputs from Machine Translation Systems , 2008, EMNLP.

[98]  Dominik Schlechtweg,et al.  Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection , 2016, EACL.

[99]  Makoto Miwa,et al.  Word Embedding-based Antonym Detection using Thesauri and Distributional Information , 2015, NAACL.

[100]  Felix Hill,et al.  Concreteness and Corpora: A Theoretical and Practical Study , 2013, CMCL.

[101]  Geoffrey Zweig,et al.  Polarity Inducing Latent Semantic Analysis , 2012, EMNLP.

[102]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[103]  Ido Dagan,et al.  Recognizing Textual Entailment: Models and Applications , 2013, Recognizing Textual Entailment: Models and Applications.

[104]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[105]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[106]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[107]  M. Murphy Semantic Relations and the Lexicon: Index , 2003 .

[108]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[109]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[110]  David J. Weir,et al.  A General Framework for Distributional Similarity , 2003, EMNLP.

[111]  Georgiana Dinu,et al.  Hubness and Pollution: Delving into Cross-Space Mapping for Zero-Shot Learning , 2015, ACL.

[112]  Daniel Jurafsky,et al.  Semantic Taxonomy Induction from Heterogenous Evidence , 2006, ACL.

[113]  Christian Biemann,et al.  Ontology Learning from Text: A Survey of Methods , 2005, LDV Forum.

[114]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[115]  Ellen M. Voorhees,et al.  The seventh text REtrieval conference (TREC-7) , 1999 .

[116]  E. H. Hutten SEMANTICS , 1953, The British Journal for the Philosophy of Science.

[117]  Sabine Schulte im Walde,et al.  A Database of Paradigmatic Semantic Relation Pairs for German Nouns, Verbs, and Adjectives , 2014, LG-LP@COLING.

[118]  Stefano Faralli,et al.  A Graph-Based Algorithm for Inducing Lexical Taxonomies from Scratch , 2011, IJCAI.

[119]  Siu Cheung Hui,et al.  Learning Term Embeddings for Taxonomic Relation Identification Using Dynamic Weighting Neural Network , 2016, EMNLP.

[120]  Daoud Clarke Context-theoretic Semantics for Natural Language: an Overview , 2009 .

[121]  Christiane Fellbaum,et al.  Co-Occurrence and Antonymy , 1995 .

[122]  Andrew McCallum,et al.  Word Representations via Gaussian Embedding , 2014, ICLR.

[123]  Raymond J. Mooney,et al.  A Mixture Model with Sharing for Lexical Semantics , 2010, EMNLP.

[124]  Ido Dagan,et al.  Directional distributional similarity for lexical inference , 2010, Natural Language Engineering.

[125]  Ngoc Thang Vu,et al.  Combining Recurrent and Convolutional Neural Networks for Relation Classification , 2016, NAACL.

[126]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[127]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[128]  David Vandyke,et al.  Counter-fitting Word Vectors to Linguistic Constraints , 2016, NAACL.

[129]  Ngoc Thang Vu,et al.  Distinguishing Antonyms and Synonyms in a Pattern-based Neural Network , 2017, EACL.

[130]  Steffen Staab,et al.  Learning Concept Hierarchies from Text Corpora using Formal Concept Analysis , 2005, J. Artif. Intell. Res..

[131]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[132]  Yulia Tsvetkov,et al.  Sparse Overcomplete Word Vector Representations , 2015, ACL.