论文信息 - Near-Optimal Smoothing of Structured Conditional Probability Matrices

Near-Optimal Smoothing of Structured Conditional Probability Matrices

Utilizing the structure of a probabilistic model can significantly increase its learning speed. Motivated by several recent applications, in particular bigram models in language processing, we consider learning low-rank conditional probability matrices under expected KL-risk. This choice makes smoothing, that is the careful handling of low-probability elements, paramount. We derive an iterative algorithm that extends classical non-negative matrix factorization to naturally incorporate additive smoothing and prove that it converges to the stationary points of a penalized empirical risk. We then derive sample-complexity bounds for the global minimizer of the penalized risk and show that it is within a small factor of the optimal sample complexity. This framework generalizes to more sophisticated smoothing techniques, including absolute-discounting.

Alon Orlitsky | Mesrob I. Ohannessian | Moein Falahatgar | A. Orlitsky | Moein Falahatgar

[1] Lizhong Zheng,et al. Euclidean Information Theory , 2008, 2008 IEEE International Zurich Seminar on Communications.

[2] Prateek Jain,et al. Learning Sparsely Used Overcomplete Dictionaries via Alternating Minimization , 2013, SIAM J. Optim..

[3] Mesrob I. Ohannessian,et al. Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications , 2014, 1412.8652.

[4] Thomas Hofmann,et al. Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[5] Omer Levy,et al. Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[6] Joris Pelemans,et al. Skip-gram Language Modeling Using Sparse Non-negative Matrix Probability Estimation , 2014, ArXiv.

[7] Stanley F. Chen,et al. An empirical study of smoothing techniques for language modeling , 1999 .

[8] Anastasios Kyrillidis,et al. Dropping Convexity for Faster Semi-definite Optimization , 2015, COLT.

[9] H. Sebastian Seung,et al. Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[10] Aleks Jakulin,et al. Applying Discrete PCA in Data Analysis , 2004, UAI.

[11] Mari Ostendorf,et al. A Sparse Plus Low-Rank Exponential Language Model for Limited Resource Scenarios , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13] Hermann Ney,et al. Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[14] Mari Ostendorf,et al. Low Rank Language Models for Small Training Sets , 2011, IEEE Signal Processing Letters.

[15] Eric P. Xing,et al. Language Modeling with Power Low Rank Ensembles , 2013, EMNLP.

[16] Sanjeev Arora,et al. Simple, Efficient, and Neural Algorithms for Sparse Coding , 2015, COLT.

[17] Erkki Oja,et al. Multiplicative Updates for Learning with Stochastic Matrices , 2013, SCIA.

[18] Tony Robinson,et al. Scaling recurrent neural network language models , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Munther A. Dahleh,et al. Rare Probability Estimation under Regularly Varying Heavy Tails , 2012, COLT.

[20] Alon Orlitsky,et al. Optimal Probability Estimation with Applications to Prediction and Classification , 2013, COLT.

[21] Alon Orlitsky,et al. Competitive Distribution Estimation: Why is Good-Turing Good , 2015, NIPS.

[22] Thorsten Brants,et al. Large Language Models in Machine Translation , 2007, EMNLP.

[23] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[24] Lukás Burget,et al. Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Gregory Valiant,et al. Instance Optimal Learning , 2015, ArXiv.

[26] Qingqing Huang,et al. Recovering Structured Probability Matrices , 2016, ITCS.

[27] Alon Orlitsky,et al. On Learning Distributions from their Samples , 2015, COLT.

[28] Santosh S. Vempala,et al. Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[29] Naoki Abe,et al. Polynomial learnability of probabilistic concepts with respect to the Kullback-Leibler divergence , 1991, COLT '91.

[30] F ChenStanley,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.