Coresets for Scalable Bayesian Logistic Regression

The use of Bayesian methods in large-scale data settings is attractive because of the rich hierarchical models, uncertainty quantification, and prior specification they provide. Standard Bayesian inference algorithms are computationally expensive, however, making their direct application to large datasets difficult or infeasible. Recent work on scaling Bayesian inference has focused on modifying the underlying algorithms to, for example, use only a random data subsample at each iteration. We leverage the insight that data is often redundant to instead obtain a weighted subset of the data (called a coreset) that is much smaller than the original dataset. We can then use this small coreset in any number of existing posterior inference algorithms without modification. In this paper, we develop an efficient coreset construction algorithm for Bayesian logistic regression models. We provide theoretical guarantees on the size and approximation quality of the coreset -- both for fixed, known datasets, and in expectation for a wide class of data generative models. Crucially, the proposed approach also permits efficient construction of the coreset in both streaming and parallel settings, with minimal additional effort. We demonstrate the efficacy of our approach on a number of synthetic and real-world datasets, and find that, in practice, the size of the coreset is independent of the original dataset size. Furthermore, constructing the coreset takes a negligible amount of time compared to that required to run MCMC on it.

[1]  Jonathan P. How,et al.  Streaming, Distributed Variational Inference for Bayesian Nonparametrics , 2015, NIPS.

[2]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[3]  R. Tweedie,et al.  Exponential convergence of Langevin distributions and their discrete approximations , 1996 .

[4]  Andreas Krause,et al.  Scalable Training of Mixture Models via Coresets , 2011, NIPS.

[5]  M. Betancourt The Fundamental Incompatibility of Hamiltonian Monte Carlo and Data Subsampling , 2015, 1502.01510.

[6]  J. Rosenthal,et al.  Optimal scaling for various Metropolis-Hastings algorithms , 2001 .

[7]  M. G. Pittau,et al.  A weakly informative default prior distribution for logistic and other regression models , 2008, 0901.4011.

[8]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[9]  Pierre Alquier,et al.  Noisy Monte Carlo: convergence of Markov chains with approximate transition kernels , 2014, Statistics and Computing.

[10]  T. J. Mitchell,et al.  Bayesian Variable Selection in Linear Regression , 1988 .

[11]  Michael I. Jordan,et al.  Variational Consensus Monte Carlo , 2015, NIPS.

[12]  Arnaud Doucet,et al.  On Markov chain Monte Carlo methods for tall data , 2015, J. Mach. Learn. Res..

[13]  Ryan P. Adams,et al.  Firefly Monte Carlo: Exact MCMC with Subsets of Data , 2014, UAI.

[14]  Max Welling,et al.  Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring , 2012, ICML.

[15]  Kasturi R. Varadarajan,et al.  Geometric Approximation via Coresets , 2007 .

[16]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[17]  Max Welling,et al.  Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget , 2013, ICML 2014.

[18]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[19]  Tong Zhang,et al.  Local Uncertainty Sampling for Large-Scale Multi-Class Logistic Regression , 2016, The Annals of Statistics.

[20]  Volkan Cevher,et al.  WASP: Scalable Bayes via barycenters of subset posteriors , 2015, AISTATS.

[21]  H. Haario,et al.  An adaptive Metropolis algorithm , 2001 .

[22]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[23]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[24]  Yee Whye Teh,et al.  Consistency and Fluctuations For Stochastic Gradient Langevin Dynamics , 2014, J. Mach. Learn. Res..

[25]  Vladimir Braverman,et al.  New Frameworks for Offline and Streaming Coreset Constructions , 2016, ArXiv.

[26]  Ahn,et al.  Bayesian posterior sampling via stochastic gradient Fisher scoring Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring , 2012 .

[27]  Christian Posse,et al.  Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction , 2002, Data Mining and Knowledge Discovery.

[28]  Andreas Krause,et al.  Approximate K-Means++ in Sublinear Time , 2016, AAAI.

[29]  Edward I. George,et al.  Bayes and big data: the consensus Monte Carlo algorithm , 2016, Big Data and Information Theory.

[30]  Radu V. Craiu,et al.  Likelihood inflating sampling algorithm , 2016, 1605.02113.

[31]  Andreas Krause,et al.  Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures , 2015, AISTATS.

[32]  Andre Wibisono,et al.  Streaming Variational Bayes , 2013, NIPS.

[33]  Andreas Krause,et al.  Coresets for Nonparametric Estimation - the Case of DP-Means , 2015, ICML.

[34]  S. Li Concise Formulas for the Area and Volume of a Hyperspherical Cap , 2011 .

[35]  N. Pillai,et al.  Ergodicity of Approximate MCMC Chains with Applications to Large Data Sets , 2014, 1405.0182.

[36]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[37]  Arnaud Doucet,et al.  Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach , 2014, ICML.

[38]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .