Accelerating Minibatch Stochastic Gradient Descent Using Typicality Sampling

Machine learning, especially deep neural networks, has developed rapidly in fields, including computer vision, speech recognition, and reinforcement learning. Although minibatch stochastic gradient descent (SGD) is one of the most popular stochastic optimization methods for training deep networks, it shows a slow convergence rate due to the large noise in the gradient approximation. In this article, we attempt to remedy this problem by building a more efficient batch selection method based on typicality sampling, which reduces the error of gradient estimation in conventional minibatch SGD. We analyze the convergence rate of the resulting typical batch SGD algorithm and compare the convergence properties between the minibatch SGD and the algorithm. Experimental results demonstrate that our batch selection scheme works well and more complex minibatch SGD variants can benefit from the proposed batch selection strategy.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[3]  Xu Sun,et al.  Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.

[4]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[5]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[6]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[7]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[8]  Peter Richtárik,et al.  Importance Sampling for Minibatches , 2016, J. Mach. Learn. Res..

[9]  Yoshua Bengio,et al.  Variance Reduction in SGD by Distributed Importance Sampling , 2015, ArXiv.

[10]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[11]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[12]  Robert D. Tortora,et al.  Sampling: Design and Analysis , 2000 .

[13]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[14]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[15]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[16]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Timothy Dozat,et al.  Incorporating Nesterov Momentum into Adam , 2016 .

[18]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[19]  Frank Hutter,et al.  Online Batch Selection for Faster Training of Neural Networks , 2015, ArXiv.

[20]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[21]  Li Fei-Fei,et al.  MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels , 2017, ICML.

[22]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[24]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[25]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[26]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[27]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[28]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Tong Zhang,et al.  Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling , 2014, ArXiv.

[31]  H. Robbins A Stochastic Approximation Method , 1951 .

[32]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[33]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[34]  J. Heinonen Lectures on Lipschitz analysis , 2005 .

[35]  Jonathan J. Hull,et al.  A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  David W. Jacobs,et al.  Big Batch SGD: Automated Inference using Adaptive Batch Sizes , 2016, ArXiv.

[37]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[38]  Mark W. Schmidt,et al.  Erratum: Hybrid Deterministic-Stochastic Methods for Data Fitting , 2013, SIAM J. Sci. Comput..

[39]  Tao Qin,et al.  Learning What Data to Learn , 2017, ArXiv.