Ada-boundary: accelerating DNN training via adaptive boundary batch selection

Neural networks converge faster with help from a smart batch selection strategy. In this regard, we propose Ada-Boundary, a novel and simple adaptive batch selection algorithm that constructs an effective mini-batch according to the learning progress of the model. Our key idea is to exploit confusing samples for which the model cannot predict labels with high confidence. Thus, samples near the current decision boundary are considered to be the most effective for expediting convergence. Taking advantage of this design, Ada-Boundary maintained its dominance for various degrees of training difficulty. We demonstrate the advantage of Ada-Boundary by extensive experimentation using CNNs with five benchmark data sets. Ada-Boundary was shown to produce a relative improvement in test errors by up to 31.80% compared with the baseline for a fixed wall-clock training time, thereby achieving a faster convergence speed.

[1]  Sundong Kim,et al.  Carpe Diem, Seize the Samples Uncertain "at the Moment" for Adaptive Batch Selection , 2019, CIKM.

[2]  Alexander J. Smola,et al.  Sampling Matters in Deep Embedding Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[4]  Gregory W. Wornell,et al.  Quantization index modulation: A class of provably good methods for digital watermarking and information embedding , 2001, IEEE Trans. Inf. Theory.

[5]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[6]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[8]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[9]  Tao Qin,et al.  Neural Data Filter for Bootstrapping Stochastic Gradient Descent , 2017 .

[10]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[11]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[12]  Jae-Gil Lee,et al.  SELFIE: Refurbishing Unclean Samples for Robust Deep Learning , 2019, ICML.

[13]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[14]  Frank Hutter,et al.  Online Batch Selection for Faster Training of Neural Networks , 2015, ArXiv.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Andrew McCallum,et al.  Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples , 2017, NIPS.

[17]  Mark W. Schmidt,et al.  Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields , 2015, AISTATS.

[18]  Eric P. Xing,et al.  Easy Questions First? A Case Study on Curriculum Learning for Question Answering , 2016, ACL.

[19]  Abhinav Gupta,et al.  Training Region-Based Object Detectors with Online Hard Example Mining , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  François Fleuret,et al.  Not All Samples Are Created Equal: Deep Learning with Importance Sampling , 2018, ICML.

[21]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[22]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  Tianxiang Gao,et al.  Sample Importance in Training Deep Neural Networks , 2017 .

[25]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[26]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[27]  Yulia Tsvetkov,et al.  Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning , 2016, ACL.