Free Supervision from Video Games

Deep networks are extremely hungry for data. They devour hundreds of thousands of labeled images to learn robust and semantically meaningful feature representations. Current networks are so data hungry that collecting labeled data has become as important as designing the networks themselves. Unfortunately, manual data collection is both expensive and time consuming. We present an alternative, and show how ground truth labels for many vision tasks are easily extracted from video games in real time as we play them. We interface the popular Microsoft® DirectX® rendering API, and inject specialized rendering code into the game as it is running. This code produces ground truth labels for instance segmentation, semantic labeling, depth estimation, optical flow, intrinsic image decomposition, and instance tracking. Instead of labeling images, a researcher now simply plays video games all day long. Our method is general and works on a wide range of video games. We collected a dataset of 220k training images, and 60k test images across 3 video games, and evaluate state of the art optical flow, depth estimation and intrinsic image decomposition algorithms. Our video game data is visually closer to real world images, than other synthetic dataset.

[1]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[2]  Brendan McCane,et al.  On Benchmarking Optical Flow , 2001, Comput. Vis. Image Underst..

[3]  David J. Fleet,et al.  Performance of optical flow techniques , 1994, International Journal of Computer Vision.

[4]  Richard Szeliski,et al.  A Database and Evaluation Methodology for Optical Flow , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[5]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[6]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Edward H. Adelson,et al.  Ground truth dataset and baseline evaluations for intrinsic image algorithms , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[8]  Charless C. Fowlkes,et al.  Contour Detection and Hierarchical Image Segmentation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Antonio Torralba,et al.  Evaluation of image features using a photorealistic virtual world , 2011, 2011 International Conference on Computer Vision.

[10]  Michael J. Black,et al.  A Naturalistic Open Source Movie for Optical Flow Evaluation , 2012, ECCV.

[11]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Ross B. Girshick,et al.  Efficient Human Pose Estimation from Single Depth Images , 2013, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[14]  Noah Snavely,et al.  Intrinsic images in the wild , 2014, ACM Trans. Graph..

[15]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[16]  Andrew J. Davison,et al.  A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[17]  Andrew W. Fitzgibbon,et al.  Accurate, Robust, and Flexible Real-time Hand Tracking , 2015, CHI.

[18]  Stella X. Yu,et al.  Direct Intrinsics: Learning Albedo-Shading Decomposition by Convolutional Regression , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Yann LeCun,et al.  Learning to Linearize Under Uncertainty , 2015, NIPS.

[20]  Andreas Geiger,et al.  Object scene flow for autonomous vehicles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[22]  Thomas Brox,et al.  A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Vladlen Koltun,et al.  Playing for Data: Ground Truth from Computer Games , 2016, ECCV.

[24]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Weifeng Chen,et al.  Single-Image Depth Perception in the Wild , 2016, NIPS.

[26]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Trevor Darrell,et al.  Adversarial Feature Learning , 2016, ICLR.

[29]  Vladlen Koltun,et al.  Photographic Image Synthesis with Cascaded Refinement Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Peter V. Gehler,et al.  Reflectance Adaptive Filtering Improves Intrinsic Image Estimation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Yang Gao,et al.  End-to-End Learning of Driving Models from Large-Scale Video Datasets , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Vladlen Koltun,et al.  Playing for Benchmarks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Didier Stricker,et al.  Flow Fields: Dense Correspondence Fields for Highly Accurate Large Displacement Optical Flow Estimation , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.