When and Why Test-Time Augmentation Works

Test-time augmentation (TTA)---the aggregation of predictions across transformed versions of a test input---is a common practice in image classification. In this paper, we present theoretical and experimental analyses that shed light on 1) when test time augmentation is likely to be helpful and 2) when to use various test-time augmentation policies. A key finding is that even when TTA produces a net improvement in accuracy, it can change many correct predictions into incorrect predictions. We delve into when and why test-time augmentation changes a prediction from being correct to incorrect and vice versa. Our analysis suggests that the nature and amount of training data, the model architecture, and the augmentation policy all matter. Building on these insights, we present a learning-based method for aggregating test-time augmentations. Experiments across a diverse set of models, datasets, and augmentations show that our method delivers consistent improvements over existing approaches.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Kensuke Yokoi,et al.  APAC: Augmented PAttern Classification with Neural Networks , 2015, ArXiv.

[3]  Hiroshi Koga,et al.  Image Classification of Melanoma, Nevus and Seborrheic Keratosis by Deep Neural Network Ensemble , 2017, ArXiv.

[4]  Lanfen Lin,et al.  A deep 3D residual CNN for false-positive reduction in pulmonary nodule detection. , 2018, Medical physics.

[5]  James A. Storer,et al.  Deflecting Adversarial Attacks with Pixel Deflection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[8]  Xiaohua Zhai,et al.  Are we done with ImageNet? , 2020, ArXiv.

[9]  Dmitry Vetrov,et al.  Greedy Policy Search: A Simple Baseline for Learnable Test-Time Augmentation , 2020, UAI.

[10]  Seyed-Mohsen Moosavi-Dezfooli,et al.  DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andrew G. Howard,et al.  Some Improvements on Deep Convolutional Neural Network Based Image Classification , 2013, ICLR.

[12]  Quoc V. Le,et al.  AutoAugment: Learning Augmentation Strategies From Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Philipp Berens,et al.  Test-time Data Augmentation for Estimation of Heteroscedastic Aleatoric Uncertainty in Deep Neural Networks , 2018 .

[14]  Yang Song,et al.  PixelDefend: Leveraging Generative Models to Understand and Defend against Adversarial Examples , 2017, ICLR.

[15]  Alexei A. Efros,et al.  What Should Not Be Contrastive in Contrastive Learning , 2020, ICLR.

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[18]  J. Zico Kolter,et al.  Certified Adversarial Robustness via Randomized Smoothing , 2019, ICML.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Yarin Gal,et al.  Understanding Measures of Uncertainty for Adversarial Example Detection , 2018, UAI.

[21]  Kouichi Sakurai,et al.  One Pixel Attack for Fooling Deep Neural Networks , 2017, IEEE Transactions on Evolutionary Computation.

[22]  Younghoon Kim,et al.  Learning Loss for Test-Time Augmentation , 2020, NeurIPS.

[23]  Sébastien Ourselin,et al.  Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation with convolutional neural networks , 2018, Neurocomputing.