Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need

Large datasets have been crucial to the success of deep learning models in the recent years, which keep performing better as they are trained with more labelled data. While there have been sustained efforts to make these models more data-efficient, the potential benefit of understanding the data itself, is largely untapped. Specifically, focusing on object recognition tasks, we wonder if for common benchmark datasets we can do better than random subsets of the data and find a subset that can generalize on par with the full dataset when trained on. To our knowledge, this is the first result that can find notable redundancies in CIFAR-10 and ImageNet datasets (at least 10%). Interestingly, we observe semantic correlations between required and redundant images. We hope that our findings can motivate further research into identifying additional redundancies and exploiting them for more efficient training or data-collection.

[1]  Tao Qin,et al.  Neural Data Filter for Bootstrapping Stochastic Gradient Descent , 2017 .

[2]  Lucila Ohno-Machado,et al.  Improving machine learning performance by removing redundant cases in medical data sets , 1998, AMIA.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Jordi Pont-Tuset,et al.  The Open Images Dataset V4 , 2018, International Journal of Computer Vision.

[6]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[7]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Zhe Gan,et al.  Variational Autoencoder for Deep Learning of Images, Labels and Captions , 2016, NIPS.

[9]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[10]  Tolga Tasdizen,et al.  Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning , 2016, NIPS.

[11]  Rishabh K. Iyer,et al.  Learning From Less Data: Diversified Subset Selection and Active Learning in Image Classification Tasks , 2018, ArXiv.

[12]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[13]  D. Defays,et al.  An Efficient Algorithm for a Complete Link Method , 1977, Comput. J..

[14]  Jitendra Malik,et al.  Are All Training Examples Created Equal? An Empirical Study , 2018, ArXiv.

[15]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[16]  Antonio Torralba,et al.  Are all training examples equally valuable? , 2013, ArXiv.

[17]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[18]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[19]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[20]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[21]  Xavier Gastaldi,et al.  Shake-Shake regularization , 2017, ArXiv.

[22]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[23]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Baoyuan Wu,et al.  Tencent ML-Images: A Large-Scale Multi-Label Image Database for Visual Representation Learning , 2019, IEEE Access.

[25]  Rishabh K. Iyer,et al.  Submodularity in Data Subset Selection and Active Learning , 2015, ICML.

[26]  François Fleuret,et al.  Not All Samples Are Created Equal: Deep Learning with Importance Sampling , 2018, ICML.

[27]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[28]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[29]  Nicholas Carlini,et al.  Prototypical Examples in Deep Learning: Metrics, Characteristics, and Utility , 2018 .

[30]  Bo Wang,et al.  Deep Co-Training for Semi-Supervised Image Recognition , 2018, ECCV.

[31]  Joshua B. Tenenbaum,et al.  Meta-Learning for Semi-Supervised Few-Shot Classification , 2018, ICLR.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[34]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[36]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .