Learning Semantic Word Embeddings based on Ordinal Knowledge Constraints

In this paper, we propose a general framework to incorporate semantic knowledge into the popular data-driven learning process of word embeddings to improve the quality of them. Under this framework, we represent semantic knowledge as many ordinal ranking inequalities and formulate the learning of semantic word embeddings (SWE) as a constrained optimization problem, where the data-derived objective function is optimized subject to all ordinal knowledge inequality constraints extracted from available knowledge resources such as Thesaurus and WordNet. We have demonstrated that this constrained optimization problem can be efficiently solved by the stochastic gradient descent (SGD) algorithm, even for a large number of inequality constraints. Experimental results on four standard NLP tasks, including word similarity measure, sentence completion, name entity recognition, and the TOEFL synonym selection, have all demonstrated that the quality of learned word vectors can be significantly improved after semantic knowledge is incorporated as inequality constraints during the learning process of word embeddings.

[1]  Wei Lin,et al.  Revisiting Word Embedding for Contrasting Meaning , 2015, ACL.

[2]  Richard M. Schwartz,et al.  Fast and Robust Neural Network Joint Models for Statistical Machine Translation , 2014, ACL.

[3]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[4]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[5]  Yee Whye Teh,et al.  A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[6]  Tie-Yan Liu,et al.  Knowledge-Powered Deep Learning for Word Embedding , 2014, ECML/PKDD.

[7]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[8]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[9]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[10]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[11]  James Champlin Fernald English Synonyms and Antonyms , 2009 .

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Mark Dredze,et al.  Improving Lexical Embeddings with Semantic Knowledge , 2014, ACL.

[14]  Geoffrey Zweig Explicit Representation of Antonymy in Language Modeling , 2014 .

[15]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[16]  Gang Wang,et al.  RC-NET: A General Framework for Incorporating Knowledge into Word Representations , 2014, CIKM.

[17]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[18]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[19]  Christopher J. C. Burges,et al.  The Microsoft Research Sentence Completion Challenge , 2011 .

[20]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[21]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[22]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[23]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[24]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[25]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[26]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[27]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[28]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[29]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[30]  Sameer Singh,et al.  Low-Dimensional Embeddings of Logic , 2014, ACL 2014.

[31]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[32]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[33]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[34]  Geoffrey E. Hinton,et al.  Distributed Representations , 1986, The Philosophy of Artificial Intelligence.

[35]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.