Batch IS NOT Heavy: Learning Word Representations From All Samples

Stochastic Gradient Descent (SGD) with negative sampling is the most prevalent approach to learn word representations. However, it is known that sampling methods are biased especially when the sampling distribution deviates from the true data distribution. Besides, SGD suffers from dramatic fluctuation due to the one-sample learning scheme. In this work, we propose AllVec that uses batch gradient learning to generate word representations from all training samples. Remarkably, the time complexity of AllVec remains at the same level as SGD, being determined by the number of positive samples rather than all samples. We evaluate AllVec on several benchmark tasks. Experiments show that AllVec outperforms sampling-based SGD methods with comparable efficiency, especially for small training corpora.

[1]  Enhong Chen,et al.  Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective , 2015, IJCAI.

[2]  Yu Hu,et al.  Learning Semantic Word Embeddings based on Ordinal Knowledge Constraints , 2015, ACL.

[3]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[4]  Guillaume Lample,et al.  Evaluation of Word Vector Representations by Subspace Alignment , 2015, EMNLP.

[5]  Meng Wang,et al.  A Relaxed Ranking-Based Factor Model for Recommender System from Implicit Feedback , 2016, IJCAI.

[6]  Tat-Seng Chua,et al.  Fast Matrix Factorization for Online Recommendation with Implicit Feedback , 2016, SIGIR.

[7]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[8]  Yehuda Koren,et al.  Factorization meets the neighborhood: a multifaceted collaborative filtering model , 2008, KDD.

[9]  Guy Blanc,et al.  Adaptive Sampled Softmax with Kernel Based Sampling , 2017, ICML.

[10]  Matthijs Douze,et al.  FastText.zip: Compressing text classification models , 2016, ArXiv.

[11]  Marco Idiart,et al.  Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations , 2016, ACL.

[12]  Ken-ichi Kawarabayashi,et al.  Joint Word Representation Learning Using a Corpus and a Semantic Lexicon , 2015, AAAI.

[13]  Guibing Guo,et al.  Approximating Word Ranking and Negative Sampling for Word Embedding , 2018, IJCAI.

[14]  Allan Hanbury,et al.  Word Embedding Causes Topic Shifting; Exploit Global Context! , 2017, SIGIR.

[15]  Manaal Faruqui,et al.  Community Evaluation and Exchange of Word Vectors at wordvectors.org , 2014, ACL.

[16]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[17]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[18]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[19]  Qiang Yang,et al.  Co-clustering based classification for out-of-domain documents , 2007, KDD '07.

[20]  Karl Stratos,et al.  Model-based Word Embeddings from Decompositions of Count Matrices , 2015, ACL.

[21]  Weinan Zhang,et al.  Improving Negative Sampling for Word Representation using Self-embedded Features , 2017, WSDM.

[22]  Weinan Zhang,et al.  LambdaFM: Learning Optimal Ranking with Factorization Machines Using Lambda Surrogates , 2016, CIKM.

[23]  Abhishek Kumar,et al.  Incorporating Relational Knowledge into Word Representations using Subspace Regularization , 2016, ACL.

[24]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[25]  Wei Lu,et al.  Improving Word Embeddings with Convolutional Feature Learning and Subword Information , 2017, AAAI.

[26]  Yoshua Bengio,et al.  Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model , 2008, IEEE Transactions on Neural Networks.

[27]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[28]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[29]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[30]  Gemma Boleda,et al.  Distributional Semantics in Technicolor , 2012, ACL.

[31]  Steven Schockaert,et al.  Jointly Learning Word Embeddings and Latent Topics , 2017, SIGIR.

[32]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[33]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[34]  Joseph K. Bradley,et al.  Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[35]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[36]  Zhiyuan Liu,et al.  Topical Word Embeddings , 2015, AAAI.

[37]  Xiangnan He,et al.  A Generic Coordinate Descent Framework for Learning from Implicit Feedback , 2016, WWW.

[38]  Andrew Y. Ng,et al.  Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[39]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[40]  Zhiyuan Liu,et al.  Improved Word Representation Learning with Sememes , 2017, ACL.

[41]  Aline Villavicencio,et al.  Enhancing the LexVec Distributed Word Representation Model Using Positional Contexts and External Memory , 2016, ArXiv.

[42]  Min-Yen Kan,et al.  SWIM: A Simple Word Interaction Model for Implicit Discourse Relation Recognition , 2017, IJCAI.

[43]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[44]  Mohamed Nadif,et al.  Non-negative Matrix Factorization Meets Word Embedding , 2017, SIGIR.