Automatic Data Augmentation from Massive Web Images for Deep Visual Recognition

Large-scale image datasets and deep convolutional neural networks (DCNNs) are the two primary driving forces for the rapid progress in generic object recognition tasks in recent years. While lots of network architectures have been continuously designed to pursue lower error rates, few efforts are devoted to enlarging existing datasets due to high labeling costs and unfair comparison issues. In this article, we aim to achieve lower error rates by augmenting existing datasets in an automatic manner. Our method leverages both the web and DCNN, where the web provides massive images with rich contextual information, and DCNN replaces humans to automatically label images under the guidance of web contextual information. Experiments show that our method can automatically scale up existing datasets significantly from billions of web pages with high accuracy. The performance on object recognition tasks and transfer learning tasks have been significantly improved by using the automatically augmented datasets, which demonstrates that more supervisory information has been automatically gathered from the web. Both the dataset and models trained on the dataset have been made publicly available.

[1]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[3]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[4]  Yong Rui,et al.  Image search—from thousands to billions in 20 years , 2013, TOMCCAP.

[5]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[6]  Jing Wang,et al.  Clickage: towards bridging semantic and intent gaps via mining click logs of search engines , 2013, ACM Multimedia.

[7]  Xiaogang Wang,et al.  Learning from massive noisy labeled data for image classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jinhui Tang,et al.  Weakly Supervised Deep Metric Learning for Community-Contributed Image Retrieval , 2015, IEEE Transactions on Multimedia.

[9]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[10]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[13]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[14]  Jian Zhang,et al.  Exploiting Web Images for Dataset Construction: A Domain Robust Approach , 2016, IEEE Transactions on Multimedia.

[15]  G. Griffin,et al.  Caltech-256 Object Category Dataset , 2007 .

[16]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[17]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[18]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[19]  Fei-Fei Li,et al.  Novel Dataset for Fine-Grained Image Categorization : Stanford Dogs , 2012 .

[20]  Adrian Popescu,et al.  On Deep Representation Learning from Noisy Web Images , 2015, ArXiv.

[21]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[22]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[24]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[25]  Shuang Wang,et al.  INSTRE: A New Benchmark for Instance-Level Object Retrieval and Recognition , 2015, ACM Trans. Multim. Comput. Commun. Appl..

[26]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[27]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[28]  Hongxun Yao,et al.  Learning Cross Space Mapping via DNN Using Large Scale Click-Through Logs , 2015, IEEE Transactions on Multimedia.

[29]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[30]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[31]  Barbara Caputo,et al.  Learning deep visual object models from noisy web data: How to make it work , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[34]  Wei-Ying Ma,et al.  Clustering and searching WWW images using link and page layout analysis , 2007, TOMCCAP.

[35]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[36]  Tiejun Zhao,et al.  Automatic Image Dataset Construction from Click-through Logs Using Deep Neural Network , 2015, ACM Multimedia.

[37]  Jonathan Krause,et al.  The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition , 2015, ECCV.

[38]  Jiebo Luo,et al.  Weakly Semi-Supervised Deep Learning for Multi-Label Image Annotation , 2015, IEEE Transactions on Big Data.

[39]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[40]  Subhransu Maji,et al.  Fine-Grained Visual Classification of Aircraft , 2013, ArXiv.

[41]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[42]  Wei Li,et al.  WebVision Database: Visual Learning and Understanding from Web Data , 2017, ArXiv.

[43]  Bernd Freisleben,et al.  Long-Term Incremental Web-Supervised Learning of Visual Concepts via Random Savannas , 2012, IEEE Transactions on Multimedia.

[44]  Ling Shao,et al.  Extracting Multiple Visual Senses for Web Learning , 2019, IEEE Transactions on Multimedia.

[45]  Dong Xu,et al.  Exploiting Privileged Information from Web Data for Image Categorization , 2014, ECCV.

[46]  Joan Bruna,et al.  Training Convolutional Networks with Noisy Labels , 2014, ICLR 2014.

[47]  Fei-Fei Li,et al.  Towards Scalable Dataset Construction: An Active Learning Approach , 2008, ECCV.