The Maximum Entropy Relaxation Path

The relaxed maximum entropy problem is concerned with finding a probability distribution on a finite set that minimizes the relative entropy to a given prior distribution, while satisfying relaxed max-norm constraints with respect to a third observed multinomial distribution. We study the entire relaxation path for this problem in detail. We show existence and a geometric description of the relaxation path. Specifically, we show that the maximum entropy relaxation path admits a planar geometric description as an increasing, piecewise linear function in the inverse relaxation parameter. We derive fast algorithms for tracking the path. In various realistic settings, our algorithms require $O(n\log(n))$ operations for probability distributions on $n$ points, making it possible to handle large problems. Once the path has been recovered, we show that given a validation set, the family of admissible models is reduced from an infinite family to a small, discrete set. We demonstrate the merits of our approach in experiments with synthetic data and discuss its potential for the estimation of compact n-gram language models.

[1]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[2]  Massimiliano Pontil,et al.  Properties of Support Vector Machines , 1998, Neural Computation.

[3]  R. Tibshirani,et al.  The solution path of the generalized lasso , 2010, 1005.1971.

[4]  P. Bühlmann,et al.  Variable Length Markov Chains: Methodology, Computing, and Software , 2004 .

[5]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[6]  Emmanuel J. Candès,et al.  Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? , 2004, IEEE Transactions on Information Theory.

[7]  Saharon Rosset,et al.  Tracking Curved Regularized Optimization Solution Paths , 2004, NIPS 2004.

[8]  M. R. Osborne,et al.  A new approach to variable selection in least squares problems , 2000 .

[9]  Robert Tibshirani,et al.  The Entire Regularization Path for the Support Vector Machine , 2004, J. Mach. Learn. Res..

[10]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[11]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[12]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[13]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[14]  P. Zhao Boosted Lasso , 2004 .

[15]  Trevor Darrell,et al.  An efficient projection for l 1 , infinity regularization. , 2009, ICML 2009.

[16]  Miroslav Dudík,et al.  Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling , 2007, J. Mach. Learn. Res..

[17]  Alistair Moffat,et al.  Implementing the PPM data compression scheme , 1990, IEEE Trans. Commun..

[18]  Dan Klein,et al.  Faster and Smaller N-Gram Language Models , 2011, ACL.

[19]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[20]  M. R. Osborne,et al.  On the LASSO and its Dual , 2000 .

[21]  S. Rosset,et al.  Piecewise linear regularized solution paths , 2007, 0708.2197.

[22]  Reinhard Kneser,et al.  Statistical language modeling using a variable context length , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[23]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[24]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[25]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[26]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .