Carpe Diem, Seize the Samples Uncertain "at the Moment" for Adaptive Batch Selection

The accuracy of deep neural networks is significantly affected by how well mini-batches are constructed during the training step. In this paper, we propose a novel adaptive batch selection algorithm called Recency Bias that exploits the uncertain samples predicted inconsistently in recent iterations. The historical label predictions of each training sample are used to evaluate its predictive uncertainty within a sliding window. Then, the sampling probability for the next mini-batch is assigned to each training sample in proportion to its predictive uncertainty. By taking advantage of this design, Recency Bias not only accelerates the training step but also achieves a more accurate network. We demonstrate the superiority of Recency Bias by extensive evaluation on two independent tasks. Compared with existing batch selection methods, the results showed that Recency Bias reduced the test error by up to 20.97% in a fixed wall-clock training time. At the same time, it improved the training time by up to 59.32% to reach the same test error.

[1]  H. Robbins A Stochastic Approximation Method , 1951 .

[2]  Jae-Gil Lee,et al.  Learning from Noisy Labels with Deep Neural Networks: A Survey , 2020, ArXiv.

[3]  Xingrui Yu,et al.  Co-teaching: Robust training of deep neural networks with extremely noisy labels , 2018, NeurIPS.

[4]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[5]  Tao Qin,et al.  Neural Data Filter for Bootstrapping Stochastic Gradient Descent , 2017 .

[6]  D. Chandler,et al.  Introduction To Modern Statistical Mechanics , 1987 .

[7]  Siddharth Gopal,et al.  Adaptive Sampling for SGD by Exploiting Side Information , 2016, ICML.

[8]  Luís Torgo,et al.  Data Mining with R: Learning with Case Studies , 2010 .

[9]  Frank Hutter,et al.  Online Batch Selection for Faster Training of Neural Networks , 2015, ArXiv.

[10]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[11]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[12]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[13]  Jeff A. Bilmes,et al.  Fixing Mini-batch Sequences with Hierarchical Robust Partitioning , 2019, AISTATS.

[14]  Fanhua Shang,et al.  Loopless Semi-Stochastic Gradient Descent with Less Hard Thresholding for Sparse Learning , 2019, CIKM.

[15]  Andrew McCallum,et al.  Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples , 2017, NIPS.

[16]  Deyu Meng,et al.  What Objective Does Self-paced Learning Indeed Optimize? , 2015, ArXiv.

[17]  Abhinav Gupta,et al.  Training Region-Based Object Detectors with Online Hard Example Mining , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Gregory W. Wornell,et al.  Quantization index modulation: A class of provably good methods for digital watermarking and information embedding , 2001, IEEE Trans. Inf. Theory.

[19]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[20]  Jinfeng Yi,et al.  Stochastic Gradient Descent with Only One Projection , 2012, NIPS.

[21]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, CVPR.

[22]  Vineeth N Balasubramanian,et al.  Submodular Batch Selection for Training Deep Neural Networks , 2019, IJCAI.

[23]  Keiji Yanai,et al.  Food image recognition with deep convolutional features , 2014, UbiComp Adjunct.

[24]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[25]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  B. Widrow,et al.  Statistical theory of quantization , 1996 .

[27]  Jae-Gil Lee,et al.  Prestopping: How Does Early Stopping Help Generalization against Label Noise? , 2019, ArXiv.

[28]  Jian-Yun Nie,et al.  An Attentive Interaction Network for Context-aware Recommendations , 2018, CIKM.

[29]  Yulia Tsvetkov,et al.  Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning , 2016, ACL.

[30]  François Fleuret,et al.  Not All Samples Are Created Equal: Deep Learning with Importance Sampling , 2018, ICML.

[31]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[32]  Jeff A. Bilmes,et al.  Minimax Curriculum Learning: Machine Teaching with Desirable Difficulties and Scheduled Diversity , 2018, ICLR.

[33]  Jae-Gil Lee,et al.  SELFIE: Refurbishing Unclean Samples for Robust Deep Learning , 2019, ICML.

[34]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[35]  Eric P. Xing,et al.  Easy Questions First? A Case Study on Curriculum Learning for Question Answering , 2016, ACL.

[36]  Sundong Kim,et al.  Ada-boundary: accelerating DNN training via adaptive boundary batch selection , 2020, Machine Learning.