MEMe: An Accurate Maximum Entropy Method for Efficient Approximations in Large-Scale Machine Learning

Efficient approximation lies at the heart of large-scale machine learning problems. In this paper, we propose a novel, robust maximum entropy algorithm, which is capable of dealing with hundreds of moments and allows for computationally efficient approximations. We showcase the usefulness of the proposed method, its equivalence to constrained Bayesian variational inference and demonstrate its superiority over existing approaches in two applications, namely, fast log determinant estimation and information-theoretic Bayesian optimisation.

[1]  A. K. Bhattacharya,et al.  Maximum entropy and the problem of moments: a stable algorithm. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[2]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[3]  Andrew Gordon Wilson,et al.  Scalable Log Determinants for Gaussian Process Kernel Learning , 2017, NIPS.

[4]  Stephen J. Roberts,et al.  A tutorial on variational Bayesian inference , 2012, Artificial Intelligence Review.

[5]  Jinwoo Shin,et al.  Large-scale log-determinant computation through stochastic Chebyshev expansions , 2015, ICML.

[6]  J. Skilling The Eigenvalues of Mega-dimensional Matrices , 1989 .

[7]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[8]  Zi Wang,et al.  Max-value Entropy Search for Efficient Bayesian Optimization , 2017, ICML.

[9]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[10]  Yousef Saad,et al.  Fast Estimation of tr(f(A)) via Stochastic Lanczos Quadrature , 2017, SIAM J. Matrix Anal. Appl..

[11]  Stephen J. Roberts,et al.  An information and field theoretic approach to the grand canonical ensemble , 2017, 1703.10099.

[12]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[13]  Ben Taskar,et al.  Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[14]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[15]  Nando de Freitas,et al.  Active Policy Learning for Robot Planning and Exploration under Uncertainty , 2007, Robotics: Science and Systems.

[16]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[17]  Stephen J. Roberts,et al.  Entropic determinants of massive matrices , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[18]  Philipp Hennig,et al.  Entropy Search for Information-Efficient Global Optimization , 2011, J. Mach. Learn. Res..

[19]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[20]  Tao Wang,et al.  Automatic Gait Optimization with Gaussian Process Regression , 2007, IJCAI.

[21]  K. Dill,et al.  Principles of maximum entropy and maximum caliber in statistical physics , 2013 .

[22]  G. Larry Bretthorst,et al.  The maximum entropy method of moments and Bayesian probability theory , 2013 .

[23]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[24]  Cassio Neri,et al.  Maximum entropy distributions inferred from option portfolios on an asset , 2012, Finance Stochastics.

[25]  L. Mead,et al.  Maximum entropy in the problem of moments , 1984 .

[26]  Ali Jalali,et al.  Hybrid Batch Bayesian Optimization , 2012, ICML.

[27]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[28]  Michael A. Osborne,et al.  Probabilistic numerics and uncertainty in computations , 2015, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[29]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[30]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[31]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[32]  Stephen J. Roberts,et al.  Entropic Trace Estimates for Log Determinants , 2017, ECML/PKDD.

[33]  Matthew W. Hoffman,et al.  Predictive Entropy Search for Efficient Global Optimization of Black-box Functions , 2014, NIPS.

[34]  Martin J. Wainwright,et al.  Log-determinant relaxation for approximate inference in discrete Markov random fields , 2006, IEEE Transactions on Signal Processing.

[35]  Y. Zhang,et al.  Approximate implementation of the logarithm of the matrix determinant in Gaussian process regression , 2007 .

[36]  Hugh F. Durrant-Whyte,et al.  On entropy approximation for Gaussian mixture random vectors , 2008, 2008 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems.

[37]  P. Rousseeuw,et al.  Minimum volume ellipsoid , 2009 .

[38]  Sean A. Ali,et al.  Application of the maximum relative entropy method to the physics of ferromagnetic materials , 2016, 1603.00068.

[39]  Nando de Freitas,et al.  On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning , 2014, AISTATS.

[40]  M. Hutchinson A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines , 1989 .

[41]  Diego Granziol,et al.  Fast Information-theoretic Bayesian Optimisation , 2017, ICML.