Per-Example Gradient Regularization Improves Learning Signals from Noisy Data

Gradient regularization, as described in \citet{barrett2021implicit}, is a highly effective technique for promoting flat minima during gradient descent. Empirical evidence suggests that this regularization technique can significantly enhance the robustness of deep learning models against noisy perturbations, while also reducing test error. In this paper, we explore the per-example gradient regularization (PEGR) and present a theoretical analysis that demonstrates its effectiveness in improving both test error and robustness against noise perturbations. Specifically, we adopt a signal-noise data model from \citet{cao2022benign} and show that PEGR can learn signals effectively while suppressing noise. In contrast, standard gradient descent struggles to distinguish the signal from the noise, leading to suboptimal generalization performance. Our analysis reveals that PEGR penalizes the variance of pattern learning, thus effectively suppressing the memorization of noises from the training data. These findings underscore the importance of variance control in deep learning training and offer useful insights for developing more effective training approaches.

[1]  Quanquan Gu,et al.  Benign Overfitting for Two-layer ReLU Networks , 2023, ArXiv.

[2]  Zhiyuan Li,et al.  How Does Sharpness-Aware Minimization Minimize Sharpness? , 2022, ArXiv.

[3]  D. Barrett,et al.  Why neural networks find simple solutions: the many regularizers of geometric complexity , 2022, NeurIPS.

[4]  Nicolas Flammarion,et al.  Towards Understanding Sharpness-Aware Minimization , 2022, ICML.

[5]  Mikhail Belkin,et al.  Benign Overfitting in Two-layer Convolutional Neural Networks , 2022, NeurIPS.

[6]  Niladri S. Chatterji,et al.  Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data , 2022, COLT.

[7]  Yang Zhao,et al.  Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning , 2022, ICML.

[8]  Jianfeng Yao,et al.  Impact of classification difficulty on the weight matrices spectra in Deep Learning and application to early-stopping , 2021, 2111.13331.

[9]  Mikhail Belkin,et al.  Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures , 2021, NeurIPS.

[10]  Ariel Kleiner,et al.  Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ICLR.

[11]  D. Barrett,et al.  Implicit Gradient Regularization , 2020, ICLR.

[12]  Philip M. Long,et al.  Finite-sample analysis of interpolating linear classifiers in the overparameterized regime , 2020, J. Mach. Learn. Res..

[13]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[14]  Michael W. Mahoney,et al.  Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning , 2018, J. Mach. Learn. Res..

[15]  Yuanzhi Li,et al.  Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning , 2020, ICLR.

[16]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[17]  Guy Blanc,et al.  Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process , 2019, COLT.

[18]  Ekaba Bisong,et al.  Regularization for Deep Learning , 2019, Building Machine Learning and Deep Learning Models on Google Cloud Platform.

[19]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.