论文信息 - Noise Contrastive Estimation and Negative Sampling for Conditional Models: Consistency and Statistical Efficiency

Noise Contrastive Estimation and Negative Sampling for Conditional Models: Consistency and Statistical Efficiency

Noise Contrastive Estimation (NCE) is a powerful parameter estimation method for log-linear models, which avoids calculation of the partition function or its derivatives at each training step, a computationally demanding step in many cases. It is closely related to negative sampling methods, now widely used in NLP. This paper considers NCE-based estimation of conditional models. Conditional models are frequently encountered in practice; however there has not been a rigorous theoretical analysis of NCE in this setting, and we will argue there are subtle but important questions when generalizing NCE to the conditional case. In particular, we analyze two variants of NCE for conditional models: one based on a classification objective, the other based on a ranking objective. We show that the ranking-based variant of NCE gives consistent parameter estimates under weaker assumptions than the classification-based method; we analyze the statistical efficiency of the ranking-based and classification-based variants of NCE; finally we describe experiments on synthetic data and language modeling showing the effectiveness and tradeoffs of both methods.

Zhuang Ma | Michael Collins | Michael Collins | Zhuang Ma

[1] Ruslan Salakhutdinov,et al. Breaking the Softmax Bottleneck: A High-Rank RNN Language Model , 2017, ICLR.

[2] Yonghui Wu,et al. Exploring the Limits of Language Modeling , 2016, ArXiv.

[3] Yee Whye Teh,et al. A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[4] Aapo Hyvärinen,et al. Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[5] Adam L. Berger,et al. A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[6] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[7] Wojciech Zaremba,et al. Recurrent Neural Network Regularization , 2014, ArXiv.

[8] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9] Yoshua Bengio,et al. Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model , 2008, IEEE Transactions on Neural Networks.

[10] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[11] T. Ferguson. A Course in Large Sample Theory , 1996 .

[12] Omer Levy,et al. Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.