Near-Optimal Smoothing of Structured Conditional Probability Matrices

Utilizing the structure of a probabilistic model can significantly increase its learning speed. Motivated by several recent applications, in particular bigram models in language processing, we consider learning low-rank conditional probability matrices under expected KL-risk. This choice makes smoothing, that is the careful handling of low-probability elements, paramount. We derive an iterative algorithm that extends classical non-negative matrix factorization to naturally incorporate additive smoothing and prove that it converges to the stationary points of a penalized empirical risk. We then derive sample-complexity bounds for the global minimizer of the penalized risk and show that it is within a small factor of the optimal sample complexity. This framework generalizes to more sophisticated smoothing techniques, including absolute-discounting.

[1]  Lizhong Zheng,et al.  Euclidean Information Theory , 2008, 2008 IEEE International Zurich Seminar on Communications.

[2]  Prateek Jain,et al.  Learning Sparsely Used Overcomplete Dictionaries via Alternating Minimization , 2013, SIAM J. Optim..

[3]  Mesrob I. Ohannessian,et al.  Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications , 2014, 1412.8652.

[4]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[5]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[6]  Joris Pelemans,et al.  Skip-gram Language Modeling Using Sparse Non-negative Matrix Probability Estimation , 2014, ArXiv.

[7]  Stanley F. Chen,et al.  An empirical study of smoothing techniques for language modeling , 1999 .

[8]  Anastasios Kyrillidis,et al.  Dropping Convexity for Faster Semi-definite Optimization , 2015, COLT.

[9]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[10]  Aleks Jakulin,et al.  Applying Discrete PCA in Data Analysis , 2004, UAI.

[11]  Mari Ostendorf,et al.  A Sparse Plus Low-Rank Exponential Language Model for Limited Resource Scenarios , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[14]  Mari Ostendorf,et al.  Low Rank Language Models for Small Training Sets , 2011, IEEE Signal Processing Letters.

[15]  Eric P. Xing,et al.  Language Modeling with Power Low Rank Ensembles , 2013, EMNLP.

[16]  Sanjeev Arora,et al.  Simple, Efficient, and Neural Algorithms for Sparse Coding , 2015, COLT.

[17]  Erkki Oja,et al.  Multiplicative Updates for Learning with Stochastic Matrices , 2013, SCIA.

[18]  Tony Robinson,et al.  Scaling recurrent neural network language models , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Munther A. Dahleh,et al.  Rare Probability Estimation under Regularly Varying Heavy Tails , 2012, COLT.

[20]  Alon Orlitsky,et al.  Optimal Probability Estimation with Applications to Prediction and Classification , 2013, COLT.

[21]  Alon Orlitsky,et al.  Competitive Distribution Estimation: Why is Good-Turing Good , 2015, NIPS.

[22]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[23]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[24]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Gregory Valiant,et al.  Instance Optimal Learning , 2015, ArXiv.

[26]  Qingqing Huang,et al.  Recovering Structured Probability Matrices , 2016, ITCS.

[27]  Alon Orlitsky,et al.  On Learning Distributions from their Samples , 2015, COLT.

[28]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[29]  Naoki Abe,et al.  Polynomial learnability of probabilistic concepts with respect to the Kullback-Leibler divergence , 1991, COLT '91.

[30]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.