On Feature Normalization and Data Augmentation

Modern neural network training relies heavily on data augmentation for improved generalization. After the initial success of label-preserving augmentations, there has been a recent surge of interest in label-perturbing approaches, which combine features and labels across training samples to smooth the learned decision surface. In this paper, we propose a new augmentation method that leverages the first and second moments extracted and re-injected by feature normalization. We replace the moments of the learned features of one training image by those of another, and also interpolate the target labels. As our approach is fast, operates entirely in feature space, and mixes different signals than prior methods, one can effectively combine it with existing augmentation methods. We demonstrate its efficacy across benchmark data sets in computer vision, speech, and natural language processing, where it consistently improves the generalization performance of highly competitive baseline networks.

[1]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[3]  Hal Daumé,et al.  Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[4]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[5]  Yann LeCun,et al.  Efficient Pattern Recognition Using a New Transformation Distance , 1992, NIPS.

[6]  Carla P. Gomes,et al.  Understanding Batch Normalization , 2018, NeurIPS.

[7]  Dawn Song,et al.  Natural Adversarial Examples , 2019, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Quoc V. Le,et al.  RandAugment: Practical data augmentation with no separate search , 2019, ArXiv.

[10]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[11]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[12]  Balaji Lakshminarayanan,et al.  AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty , 2020, ICLR.

[13]  Junmo Kim,et al.  Deep Pyramidal Residual Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[15]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Quoc V. Le,et al.  QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.

[18]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[19]  Yixin Chen,et al.  Automatic Feature Decomposition for Single View Co-training , 2011, ICML.

[20]  Thomas B. Moeslund,et al.  Long-Term Occupancy Analysis Using Graph-Based Optimisation in Thermal Imagery , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  John Langford,et al.  Normalized Online Learning , 2013, UAI.

[22]  Bernhard Schölkopf,et al.  Incorporating Invariances in Support Vector Learning Machines , 1996, ICANN.

[23]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[25]  Quoc V. Le,et al.  Adversarial Examples Improve Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[27]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[28]  Quoc V. Le,et al.  DropBlock: A regularization method for convolutional networks , 2018, NeurIPS.

[29]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[30]  Jason Weston,et al.  Vicinal Risk Minimization , 2000, NIPS.

[31]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[32]  Guoying Li,et al.  SPHERING AND ITS PROPERTIES , 1998 .

[33]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[34]  Yong Jae Lee,et al.  Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-Supervised Object and Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[36]  Aleksander Madry,et al.  How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NeurIPS.

[37]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[38]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[39]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[41]  J. Gurland,et al.  A Simple Approximation for Unbiased Estimation of the Standard Deviation , 1971 .

[42]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Graham W. Taylor,et al.  Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[44]  David Duvenaud,et al.  Invertible Residual Networks , 2018, ICML.

[45]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[46]  Masakazu Iwamura,et al.  ShakeDrop regularization , 2018, ICLR.

[47]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[48]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[49]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[50]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[51]  Yi Yang,et al.  Random Erasing Data Augmentation , 2017, AAAI.

[52]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[53]  Jan Eric Lenssen,et al.  Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[54]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[55]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[56]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[57]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[58]  Marc'Aurelio Ranzato,et al.  Classical Structured Prediction Losses for Sequence to Sequence Learning , 2017, NAACL.

[59]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[60]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[61]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[62]  E. Tabak,et al.  DENSITY ESTIMATION BY DUAL ASCENT OF THE LOG-LIKELIHOOD ∗ , 2010 .

[63]  Gao Huang,et al.  Implicit Semantic Data Augmentation for Deep Networks , 2019, NeurIPS.

[64]  Stan Szpakowicz,et al.  Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation , 2006, Australian Conference on Artificial Intelligence.

[65]  S. Frühwirth-Schnatter Data Augmentation and Dynamic Linear Models , 1994 .

[66]  Yoshua Bengio,et al.  Towards Understanding Generalization via Analytical Learning Theory , 2018, 1802.07426.

[67]  Quoc V. Le,et al.  AutoAugment: Learning Augmentation Strategies From Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[69]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Takuya Akiba,et al.  Shakedrop Regularization for Deep Residual Learning , 2018, IEEE Access.

[71]  Stephen Tyree,et al.  Learning with Marginalized Corrupted Features , 2013, ICML.

[72]  Kilian Q. Weinberger,et al.  Convolutional Networks with Dense Connectivity , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[73]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[74]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[75]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Taesup Kim,et al.  Fast AutoAugment , 2019, NeurIPS.

[77]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[78]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[79]  Kilian Q. Weinberger,et al.  Positional Normalization , 2019, NeurIPS.

[80]  Xiao-Li Meng,et al.  The Art of Data Augmentation , 2001 .

[81]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[82]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[83]  Marcello Federico,et al.  Report on the 11th IWSLT evaluation campaign , 2014, IWSLT.

[84]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[85]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[86]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[87]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.