Learning deep visual object models from noisy web data: How to make it work

Deep networks thrive when trained on large scale data collections. This has given ImageNet a central role in the development of deep architectures for visual object classification. However, ImageNet was created during a specific period in time, and as such it is prone to aging, as well as dataset bias issues. Moving beyond fixed training datasets will lead to more robust visual systems, especially when deployed on robots in new environments which must train on the objects they encounter there. To make this possible, it is important to break free from the need for manual annotators. Recent work has begun to investigate how to use the massive amount of images available on the Web in place of manual image annotations. We contribute to this research thread with two findings: (1) a study correlating a given level of noisily labels to the expected drop in accuracy, for two deep architectures, on two different types of noise, that clearly identifies GoogLeNet as a suitable architecture for learning from Web data; (2) a recipe for the creation of Web datasets with minimal noise and maximum visual variability, based on a visual and natural language processing concept expansion strategy. By combining these two results, we obtain a method for learning powerful deep object models automatically from the Web. We confirm the effectiveness of our approach through object categorization experiments using our Web-derived version of ImageNet on a popular robot vision benchmark database, and on a lifelong object discovery task on a mobile robot.

[1]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[2]  Xinlei Chen,et al.  Webly Supervised Learning of Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Manuela M. Veloso,et al.  Using the Web to Interactively Learn to Find Objects , 2012, AAAI.

[4]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Gregory D. Hager,et al.  Beyond spatial pooling: Fine-grained representation learning in multiple domains , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[7]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[8]  Brian Coltin,et al.  Web-Based Remote Assistance to Overcome Robot Perceptual Limitations ∗ , 2013 .

[9]  Lucas Beyer,et al.  The STRANDS Project: Long-Term Autonomy in Everyday Environments , 2016, IEEE Robotics Autom. Mag..

[10]  Wolfram Burgard,et al.  Multimodal deep learning for robust RGB-D object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[11]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[12]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[13]  Fabio Maria Carlucci,et al.  A deep representation for depth images from synthetic data , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Moritz Tenorth,et al.  Open robotics research using web-based knowledge services , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[16]  Andrew Y. Ng,et al.  Convolutional-Recursive Deep Learning for 3D Object Classification , 2012, NIPS.

[17]  Pinar Duygulu Sahin,et al.  ConceptMap: Mining Noisy Web Data for Concept Learning , 2014, ECCV.

[18]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[19]  Barbara Caputo,et al.  Semantic web-mining and deep vision for lifelong object discovery , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[21]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[23]  Markus Vincze,et al.  3DNet: Large-scale object class recognition from CAD models , 2012, 2012 IEEE International Conference on Robotics and Automation.

[24]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[25]  Sven Behnke,et al.  RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[26]  Francesco Setti,et al.  Semantically-driven automatic creation of training sets for object recognition , 2015, Comput. Vis. Image Underst..

[27]  Ajmal S. Mian,et al.  Convolutional hypercube pyramid for accurate RGB-D object category and instance recognition , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[28]  Elena Cabrio,et al.  Towards Lifelong Object Learning by Integrating Situated Robot Perception and Semantic Web Mining , 2016, ECAI.

[29]  Rob Fergus,et al.  Learning from Noisy Labels with Deep Neural Networks , 2014, ICLR.

[30]  Markus Vincze,et al.  Autonomous Learning of Object Models on a Mobile Robot , 2017, IEEE Robotics and Automation Letters.

[31]  Andrew Zisserman,et al.  VISOR: Towards On-the-Fly Large-Scale Object Category Retrieval , 2012, ACCV.

[32]  Trevor Darrell,et al.  Adapting Visual Category Models to New Domains , 2010, ECCV.

[33]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[34]  Ali Farhadi,et al.  Learning Everything about Anything: Webly-Supervised Visual Concept Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.