Probabilistic FastText for Multi-Sense Word Embeddings

We introduce Probabilistic FastText, a new model for word embeddings that can capture multiple word senses, sub-word structure, and uncertainty information. In particular, we represent each word with a Gaussian mixture density, where the mean of a mixture component is given by the sum of n-grams. This representation allows the model to share statistical strength across sub-word structures (e.g. Latin roots), producing accurate representations of rare, misspelt, or even unseen words. Moreover, each component of the mixture can capture a different word sense. Probabilistic FastText outperforms both FastText, which has no probabilistic model, and dictionary-level probabilistic embeddings, which do not incorporate subword structures, on several word-similarity benchmarks, including English RareWord and foreign language datasets. We also achieve state-of-art performance on benchmarks that measure ability to discern different meanings. Thus, the proposed model is the first to achieve multi-sense representations while having enriched semantics on rare words.

[1]  Iryna Gurevych,et al.  Using the Structure of a Conceptual Network in Computing Semantic Relatedness , 2005, IJCNLP.

[2]  Andrew Y. Ng,et al.  Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[3]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[4]  Andrew Gordon Wilson,et al.  Multimodal Word Distributions , 2017, ACL.

[5]  G. Zipf,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. , 1949 .

[6]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[7]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[8]  David M. W. Powers,et al.  Verb similarity on the taxonomy of WordNet , 2006 .

[9]  Sanjeev Arora,et al.  Linear Algebraic Structure of Word Senses, with Applications to Polysemy , 2016, TACL.

[10]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[11]  Evgeniy Gabrilovich,et al.  A word at a time: computing word relatedness using temporal semantic analysis , 2011, WWW.

[12]  Andrew McCallum,et al.  Word Representations via Gaussian Embedding , 2014, ICLR.

[13]  Deniz Yuret,et al.  CharNER: Character-Level Named Entity Recognition , 2016, COLING.

[14]  Samuel L. Smith,et al.  Offline bilingual word vectors, orthogonal transformations and the inverted softmax , 2017, ICLR.

[15]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[16]  Tony Jebara,et al.  Probability Product Kernels , 2004, J. Mach. Learn. Res..

[17]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[18]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[19]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[21]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[22]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[23]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[24]  Jason Lee,et al.  Fully Character-Level Neural Machine Translation without Explicit Segmentation , 2016, TACL.

[25]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[26]  Zhihua Zhang,et al.  An Efficient Character-Level Neural Machine Translation , 2016, ArXiv.

[27]  Evgeniy Gabrilovich,et al.  Large-scale learning of word relatedness with constraints , 2012, KDD.

[28]  Andrew McCallum,et al.  Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space , 2014, EMNLP.

[29]  Roi Reichart,et al.  Judgment Language Matters: Multilingual Vector Space Models for Judgment Language Aware Lexical Semantics , 2015, ArXiv.

[30]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[31]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[32]  Zhiyuan Liu,et al.  A Unified Model for Word Sense Representation and Disambiguation , 2014, EMNLP.

[33]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[34]  Enhong Chen,et al.  A Probabilistic Model for Learning Multi-Prototype Word Embeddings , 2014, COLING.