论文信息 - Determinantal Point Processes for Mini-Batch Diversification

Determinantal Point Processes for Mini-Batch Diversification

We study a mini-batch diversification scheme for stochastic gradient descent (SGD). While classical SGD relies on uniformly sampling data points to form a mini-batch, we propose a non-uniform sampling scheme based on the Determinantal Point Process (DPP). The DPP relies on a similarity measure between data points and gives low probabilities to mini-batches which contain redundant data, and higher probabilities to mini-batches with more diverse data. This simultaneously balances the data and leads to stochastic gradients with lower variance. We term this approach Diversified Mini-Batch SGD (DM-SGD). We show that regular SGD and a biased version of stratified sampling emerge as special cases. Furthermore, DM-SGD generalizes stratified sampling to cases where no discrete features exist to bin the data into groups. We show experimentally that our method results more interpretable and diverse features in unsupervised setups, and in better classification accuracies in supervised setups.

[1] J. Neyman. On the Two Different Aspects of the Representative Method: the Method of Stratified Sampling and the Method of Purposive Selection , 1934 .

[2] Boris Polyak. Some methods of speeding up the convergence of iteration methods , 1964 .

[3] O. Macchi. The coincidence approach to stochastic point processes , 1975, Advances in Applied Probability.

[4] H. Robbins,et al. A Convergence Theorem for Non Negative Almost Supermartingales and Some Applications , 1985 .

[5] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[6] Michael I. Jordan,et al. An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[7] Richard J. Beckman,et al. A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output From a Computer Code , 2000, Technometrics.

[8] Simon Haykin,et al. GradientBased Learning Applied to Document Recognition , 2001 .

[9] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10] Stephen E. Robertson,et al. Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[11] Andrew Zisserman,et al. Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[12] Haibo He,et al. Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[13] Francis R. Bach,et al. Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[14] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[15] Léon Bottou,et al. Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[16] M. R. Leadbetter. Poisson Processes , 2011, International Encyclopedia of Statistical Science.

[17] Ben Taskar,et al. k-DPPs: Fixed-Size Determinantal Point Processes , 2011, ICML.

[18] Ben Taskar,et al. Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[19] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[20] Michael I. Jordan,et al. Variational Bayesian Inference with Stochastic Search , 2012, ICML.

[21] Ryan P. Adams,et al. Priors for Diversity in Generative Latent Variable Models , 2012, NIPS.

[22] Ben Taskar,et al. Nystrom Approximation for Large-Scale Determinantal Processes , 2013, AISTATS.

[23] Xi Chen,et al. Variance Reduction for Stochastic Gradient Optimization , 2013, NIPS.

[24] Tong Zhang,et al. Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling , 2014, ArXiv.

[25] Stefan Carlsson,et al. CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[26] Sean Gerrish,et al. Black Box Variational Inference , 2013, AISTATS.

[27] Hedvig Kjellström,et al. How to Supervise Topic Models , 2014, ECCV Workshops.

[28] David A. Knowles,et al. On Using Control Variates with Stochastic Approximation for Variational Bayes and its Connection to Stochastic Linear Regression , 2014, 1401.1022.

[29] David M. Blei,et al. Smoothed Gradients for Stochastic Variational Inference , 2014, NIPS.

[30] Pengtao Xie,et al. Diversifying Restricted Boltzmann Machine for Document Modeling , 2015, KDD.

[31] Tong Zhang,et al. Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[32] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33] Mark W. Schmidt,et al. Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields , 2015, AISTATS.

[34] Atsuto Maki,et al. From generic to specific deep representations for visual recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[35] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[36] Donghoon Lee,et al. Individualness and Determinantal Point Processes for Pedestrian Detection , 2016, ECCV.

[37] Suvrit Sra,et al. Fast DPP Sampling for Nystrom with Application to Kernel Methods , 2016, ICML.

[38] Farhan Abrol,et al. Variational Tempering , 2016, AISTATS.

[39] Allan Jabri,et al. Learning Visual Features from Large Weakly Supervised Data , 2015, ECCV.

[40] Zhihua Zhang,et al. CPSG-MCMC: Clustering-Based Preprocessing method for Stochastic Gradient MCMC , 2017, AISTATS.

[41] Volkan Cevher,et al. Faster Coordinate Descent via Adaptive Importance Sampling , 2017, AISTATS.

[42] Mark W. Schmidt,et al. Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[43] David M. Blei,et al. Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[44] Peter Richtárik,et al. Importance Sampling for Minibatches , 2016, J. Mach. Learn. Res..