Regularization in neural network optimization via trimmed stochastic gradient descent with noisy label

Regularization is essential for avoiding over-fitting to training data in neural network optimization, leading to better generalization of the trained networks. The label noise provides a strong implicit regularization by replacing the target ground truth labels of training examples by uniform random labels. However, it may also cause undesirable misleading gradients due to the large loss associated with incorrect labels. We propose a first-order optimization method (Label-Noised Trim-SGD) which combines the label noise with the example trimming in order to remove the outliers. The proposed algorithm enables us to impose a large label noise and obtain a better regularization effect than the original methods. The quantitative analysis is performed by comparing the behavior of the label noise, the example trimming, and the proposed algorithm. We also present empirical results that demonstrate the effectiveness of our algorithm using the major benchmarks and the fundamental networks, where our method has successfully outperformed the state-of-the-art optimization methods.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[3]  Yingbin Liang,et al.  Median-Truncated Gradient Descent: A Robust and Scalable Nonconvex Approach for Signal Estimation , 2019, Applied and Numerical Harmonic Analysis.

[4]  Zhiwei Steven Wu,et al.  Distributed Training with Heterogeneous Data: Bridging Median- and Mean-Based Algorithms , 2019, NeurIPS.

[5]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[6]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[7]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[8]  Quanquan Gu,et al.  Stochastic Variance-Reduced Hamilton Monte Carlo Methods , 2018, ICML.

[9]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[10]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Leon Wenliang Zhong,et al.  Fast Stochastic Alternating Direction Method of Multipliers , 2013, ICML.

[12]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[13]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[14]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[15]  H. Robbins A Stochastic Approximation Method , 1951 .

[16]  Prateek Jain,et al.  On the Insufficiency of Existing Momentum Schemes for Stochastic Optimization , 2018, 2018 Information Theory and Applications Workshop (ITA).

[17]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[18]  Kannan Ramchandran,et al.  Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates , 2018, ICML.

[19]  Warren B. Powell,et al.  Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.

[20]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[21]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[22]  Michael I. Jordan,et al.  On the Theory of Variance Reduction for Stochastic Gradient Monte Carlo , 2018, ICML.

[23]  Atsushi Sato,et al.  Layer-Wise Weight Decay for Deep Neural Networks , 2017, PSIVT.

[24]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[25]  Zebang Shen,et al.  Adaptive Variance Reducing for Stochastic Gradient Descent , 2016, IJCAI.

[26]  Yuanzhi Li,et al.  An Alternative View: When Does SGD Escape Local Minima? , 2018, ICML.

[27]  R. Venkatesh Babu,et al.  Generalized Dropout , 2016, ArXiv.

[28]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[29]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[30]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[31]  Fanhua Shang,et al.  A Simple Stochastic Variance Reduced Algorithm with Fast Convergence Rates , 2018, ICML.

[32]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[33]  Shie Mannor,et al.  Outlier Robust Online Learning , 2017, ArXiv.

[34]  Zhanxing Zhu,et al.  The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects , 2018, ICML.

[35]  Heng Huang,et al.  Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization , 2017, AAAI.

[36]  Yanyao Shen,et al.  Learning with Bad Training Data via Iterative Trimmed Loss Minimization , 2018, ICML.

[37]  Brendan J. Frey,et al.  Adaptive dropout for training deep neural networks , 2013, NIPS.

[38]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[39]  Gregory Cohen,et al.  EMNIST: Extending MNIST to handwritten letters , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[40]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[41]  Y. Nesterov A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[42]  Byung-Woo Hong,et al.  Adaptive Weight Decay for Deep Neural Networks , 2019, IEEE Access.

[43]  Dan Alistarh,et al.  Byzantine Stochastic Gradient Descent , 2018, NeurIPS.

[44]  Bohyung Han,et al.  Regularizing Deep Neural Networks by Noise: Its Interpretation and Optimization , 2017, NIPS.

[45]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[46]  Jiashi Feng,et al.  Efficient Stochastic Gradient Hard Thresholding , 2018, NeurIPS.

[47]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[48]  Edward K. Blum,et al.  Approximation theory and feedforward networks , 1991, Neural Networks.

[49]  Shiyu Chang,et al.  Zeroth-Order Stochastic Variance Reduction for Nonconvex Optimization , 2018, NeurIPS.

[50]  Nando de Freitas,et al.  Unbounded Bayesian Optimization via Regularization , 2015, AISTATS.

[51]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[52]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[53]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[54]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[55]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[56]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[57]  Lutz Prechelt,et al.  Early Stopping - But When? , 2012, Neural Networks: Tricks of the Trade.

[58]  Eugene S. Edgington,et al.  Randomization Tests , 2011, International Encyclopedia of Statistical Science.

[59]  Prabhat,et al.  Scalable Bayesian Optimization Using Deep Neural Networks , 2015, ICML.