Online Variance Reduction for Stochastic Optimization

Modern stochastic optimization methods often rely on uniform sampling which is agnostic to the underlying characteristics of the data. This might degrade the convergence by yielding estimates that suffer from a high variance. A possible remedy is to employ non-uniform importance sampling techniques, which take the structure of the dataset into account. In this work, we investigate a recently proposed setting which poses variance reduction as an online optimization problem with bandit feedback. We devise a novel and efficient algorithm for this setting that finds a sequence of importance sampling distributions competitive with the best fixed distribution in hindsight, the first result of this kind. While we present our method for sampling datapoints, it naturally extends to selecting coordinates or even blocks of thereof. Empirical validations underline the benefits of our method in several settings.

[1]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[2]  Patrick Thiran,et al.  Stochastic Optimization with Bandit Sampling , 2017, ArXiv.

[3]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[4]  Elad Hazan,et al.  Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.

[5]  Yoshua Bengio,et al.  Variance Reduction in SGD by Distributed Importance Sampling , 2015, ArXiv.

[6]  Andreas Krause,et al.  The next big one: Detecting earthquakes and other rare events from community-based sensors , 2011, Proceedings of the 10th ACM/IEEE International Conference on Information Processing in Sensor Networks.

[7]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[8]  Abhinav Gupta,et al.  Training Region-Based Object Detectors with Online Hard Example Mining , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  D. Freedman On Tail Probabilities for Martingales , 1975 .

[10]  Guillaume Bouchard,et al.  Accelerating Stochastic Gradient Descent via Online Learning to Sample , 2015, ArXiv.

[11]  Martin Jaggi,et al.  Safe Adaptive Importance Sampling , 2017, NIPS.

[12]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[13]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[14]  Matthew J. Streeter,et al.  Adaptive Bound Optimization for Online Convex Optimization , 2010, COLT 2010.

[15]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[16]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[17]  Zeyuan Allen Zhu,et al.  Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling , 2015, ICML.

[18]  JoachimsThorsten,et al.  KDD-Cup 2004 , 2004 .

[19]  John C. Duchi,et al.  Adaptive Sampling Probabilities for Non-Smooth Optimization , 2017, ICML.

[20]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[21]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[22]  Patrick Thiran,et al.  Coordinate Descent with Bandit Sampling , 2017, NeurIPS.

[23]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[24]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[25]  Volkan Cevher,et al.  Faster Coordinate Descent via Adaptive Importance Sampling , 2017, AISTATS.

[26]  Guillaume Bouchard,et al.  Online Learning to Sample , 2015, 1506.09016.

[27]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[28]  Ambuj Tewari,et al.  On the Generalization Ability of Online Strongly Convex Programming Algorithms , 2008, NIPS.

[29]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[30]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[31]  Y. Nesterov,et al.  A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE OPTIMIZATION PROBLEMS WITH LINEAR CONSTRAINTS , 2013 .

[32]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[33]  Patrick Thiran,et al.  Stochastic Dual Coordinate Descent with Bandit Sampling , 2017, ArXiv.

[34]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[35]  Peter Richtárik,et al.  Importance Sampling for Minibatches , 2016, J. Mach. Learn. Res..

[36]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.