GNEG: Graph-Based Negative Sampling for word2vec

Negative sampling is an important component in word2vec for distributed word representation learning. We hypothesize that taking into account global, corpus-level information and generating a different noise distribution for each target word better satisfies the requirements of negative examples for each training word than the original frequency-based distribution. In this purpose we pre-compute word co-occurrence statistics from the corpus and apply to it network algorithms such as random walk. We test this hypothesis through a set of experiments whose results show that our approach boosts the word analogy task by about 5% and improves the performance on word similarity tasks by about 1% compared to the skip-gram negative sampling baseline.

[1]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[2]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[3]  Yee Whye Teh,et al.  A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[4]  Alexandre Allauzen,et al.  An experimental analysis of Noise-Contrastive Estimation: the noise distribution matters , 2017, EACL.

[5]  Yoshua Bengio,et al.  On Using Very Large Target Vocabulary for Neural Machine Translation , 2014, ACL.

[6]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[7]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[8]  Alexander S. Yeh,et al.  More accurate tests for the statistical significance of result differences , 2000, COLING.

[9]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[10]  Ramon Ferrer i Cancho,et al.  The small world of human language , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[11]  Blockin Blockin,et al.  Quick Training of Probabilistic Neural Nets by Importance Sampling , 2003 .

[12]  Dragomir R. Radev,et al.  Book Review: Graph-Based Natural Language Processing and Information Retrieval by Rada Mihalcea and Dragomir Radev , 2011, CL.

[13]  Wenlin Chen,et al.  Strategies for Training Large Vocabulary Neural Language Models , 2015, ACL.

[14]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[15]  David J. Aldous,et al.  Lower bounds for covering times for reversible Markov chains and random walks on graphs , 1989 .

[16]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[17]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[18]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[19]  Pierre Zweigenbaum,et al.  Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph , 2018, TextGraphs@NAACL-HLT.

[20]  J. H. Steiger Tests for comparing elements of a correlation matrix. , 1980 .