Driving in the Matrix: Can virtual worlds replace human-generated annotations for real world tasks?

Deep learning has rapidly transformed the state of the art algorithms used to address a variety of problems in computer vision and robotics. These breakthroughs have relied upon massive amounts of human annotated training data. This time consuming process has begun impeding the progress of these deep learning efforts. This paper describes a method to incorporate photo-realistic computer images from a simulation engine to rapidly generate annotated data that can be used for the training of machine learning algorithms. We demonstrate that a state of the art architecture, which is trained only using these synthetic annotations, performs better than the identical architecture trained on human annotated real-world data, when tested on the KITTI data set for vehicle detection. By training machine learning algorithms on a rich virtual world, real objects in real scenes can be learned and classified using synthetic data. This approach offers the possibility of accelerating deep learning's application to sensor-based classification problems like those that appear in self-driving cars. The source code and data to train and validate the networks described in this paper are made available for researchers.

[1]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[2]  Yann LeCun,et al.  Off-Road Obstacle Avoidance through End-to-End Learning , 2005, NIPS.

[3]  Manuel Blum,et al.  Peekaboom: a game for locating objects in images , 2006, CHI.

[4]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[5]  Roberto Cipolla,et al.  Segmentation and Recognition Using Structure from Motion Point Clouds , 2008, ECCV.

[6]  David Vázquez,et al.  Learning appearance in virtual scenarios for pedestrian detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[8]  Alexei A. Efros,et al.  Undoing the Damage of Dataset Bias , 2012, ECCV.

[9]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Jiaolong Xu,et al.  Learning a Part-Based Pedestrian Detector in a Virtual World , 2014, IEEE Transactions on Intelligent Transportation Systems.

[11]  Takeo Kanade,et al.  Learning scene-specific pedestrian detectors without real data , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[14]  Leonidas J. Guibas,et al.  Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[16]  Ross A. Knepper,et al.  DeepMPC: Learning Deep Latent Features for Model Predictive Control , 2015, Robotics: Science and Systems.

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18]  Vladlen Koltun,et al.  Playing for Data: Ground Truth from Computer Games , 2016, ECCV.

[19]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Luis Herranz,et al.  Scene Recognition with CNNs: Objects, Scales and Dataset Bias , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).