Stochastic Batch Augmentation with An Effective Distilled Dynamic Soft Label Regularizer

Data augmentation have been intensively used in training deep neural network to improve the generalization, whether in original space (e.g., image space) or representation space. Although being successful, the connection between the synthesized data and the original data is largely ignored in training, without considering the distribution information that the synthesized samples are surrounding the original sample in training. Hence, the behavior of the network is not optimized for this. However, that behavior is crucially important for generalization, even in the adversarial setting, for the safety of the deep learning system. In this work, we propose a framework called Stochastic Batch Augmentation (SBA) to address these problems. SBA stochastically decides whether to augment at iterations controlled by the batch scheduler and in which a ''distilled'' dynamic soft label regularization is introduced by incorporating the similarity in the vicinity distribution respect to raw samples. The proposed regularization provides direct supervision by the KL-Divergence between the output soft-max distributions of original and virtual data. Our experiments on CIFAR-10, CIFAR-100, and ImageNet show that SBA can improve the generalization of the neural networks and speed up the convergence of network training.

[1]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[2]  Wenhan Wang,et al.  Bamboo: Ball-Shape Data Augmentation Against Adversarial Attacks from All Directions , 2018, SafeAI@AAAI.

[3]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[4]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[5]  Jason Weston,et al.  Vicinal Risk Minimization , 2000, NIPS.

[6]  Ioannis Mitliagkas,et al.  Manifold Mixup: Encouraging Meaningful On-Manifold Interpolation as a Regularizer , 2018, ArXiv.

[7]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[8]  Ioannis Mitliagkas,et al.  Manifold Mixup: Better Representations by Interpolating Hidden States , 2018, ICML.

[9]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[10]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11]  Wei Wei,et al.  Complement Objective Training , 2019, ICLR.

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Manfred Opper,et al.  Sparse Representation for Gaussian Process Models , 2000, NIPS.

[14]  Graham W. Taylor,et al.  Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[15]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[16]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[17]  Hongyu Guo,et al.  MixUp as Locally Linear Out-Of-Manifold Regularization , 2018, AAAI.

[18]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[19]  Takuya Shimada,et al.  Data Interpolating Prediction: Alternative Interpretation of Mixup , 2019, ArXiv.

[20]  Torsten Hoefler,et al.  Augment your batch: better training with larger batches , 2019, ArXiv.