Towards Competitive N-gram Smoothing

N -gram models remain a fundamental component of language modeling. In data-scarce regimes, they are a strong alternative to neural models. Even when not used as-is, recent work shows they can regularize neural models. Despite this success, the effectiveness of one of the best N -gram smoothing methods, the one suggested by Kneser and Ney (1995), is not fully understood. In the hopes of explaining this performance, we study it through the lens of competitive distribution estimation: the ability to perform as well as an oracle aware of further structure in the data. We first establish basic competitive properties of Kneser–Ney smoothing. We then investigate the nature of its backoff mechanism and show that it emerges from first principles, rather than being an assumption of the model. We do this by generalizing the Good–Turing estimator to the contextual setting. This exploration leads us to a powerful generalization of Kneser–Ney, which we conjecture to have even stronger competitive properties. Empirically, it significantly improves performance on language modeling, even matching feed-forward neural models. To show that the mechanisms at play are not restricted to language modeling, we demonstrate similar gains on the task of predicting attack types in the Global Terrorism Database.

[1]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[2]  Eric P. Xing,et al.  Language Modeling with Power Low Rank Ensembles , 2013, EMNLP.

[3]  Ian R. Lane,et al.  Neural network language models for low resource languages , 2014, INTERSPEECH.

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[6]  Dietrich Braess,et al.  Bernstein polynomials and learning theory , 2004, J. Approx. Theory.

[7]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[8]  Liam Paninski Variational Minimax Estimation of Discrete Distributions under KL Loss , 2004, NIPS.

[9]  Alon Orlitsky,et al.  The power of absolute discounting: all-dimensional distribution estimation , 2017, NIPS.

[10]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[11]  Alon Orlitsky,et al.  On Learning Distributions from their Samples , 2015, COLT.

[12]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[13]  Alon Orlitsky,et al.  Competitive Distribution Estimation: Why is Good-Turing Good , 2015, NIPS.

[14]  Mari Ostendorf,et al.  A Sparse Plus Low-Rank Exponential Language Model for Limited Resource Scenarios , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Alon Orlitsky,et al.  Near-Optimal Smoothing of Structured Conditional Probability Matrices , 2016, NIPS.

[16]  Samy Bengio,et al.  N-gram Language Modeling using Recurrent Neural Network Estimation , 2017, ArXiv.

[17]  Masaaki Nagata,et al.  Direct Output Connection for a High-Rank Language Model , 2018, EMNLP.

[18]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[19]  Mesrob I. Ohannessian,et al.  Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications , 2014, 1412.8652.

[20]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[21]  Ruslan Salakhutdinov,et al.  Breaking the Softmax Bottleneck: A High-Rank RNN Language Model , 2017, ICLR.

[22]  A. Shapiro,et al.  National Consortium for the Study of Terrorism and Responses to Terrorism , 2010 .

[23]  Di He,et al.  FRAGE: Frequency-Agnostic Word Representation , 2018, NeurIPS.

[24]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[25]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[26]  Daniel Jurafsky,et al.  Data Noising as Smoothing in Neural Network Language Models , 2017, ICLR.

[27]  Gregory Valiant,et al.  Instance Optimal Learning , 2015, ArXiv.

[28]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[29]  Munther A. Dahleh,et al.  Rare Probability Estimation under Regularly Varying Heavy Tails , 2012, COLT.