PennSyn2Real: Training Object Recognition Models without Human Labeling

Scalable training data generation is a critical problem in deep learning. We propose PennSyn2Real - a photo-realistic synthetic dataset consisting of more than 100,000 4K images of more than 20 types of micro aerial vehicles (MAVs). The dataset can be used to generate arbitrary numbers of training images for high-level computer vision tasks such as MAV detection and classification. Our data generation framework bootstraps chroma-keying, a mature cinematography technique with a motion tracking system, providing artifact-free and curated annotated images where object orientations and lighting are controlled. This framework is easy to set up and can be applied to a broad range of objects, reducing the gap between synthetic and real-world data. We show that synthetic data generated using this framework can be directly used to train CNN models for common object recognition tasks such as detection and segmentation. We demonstrate competitive performance in comparison with training using only real images. Furthermore, bootstrapping the generated synthetic data in few-shot learning can significantly improve the overall performance, reducing the number of required training data samples to achieve the desired accuracy.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Takeo Kanade,et al.  How Useful Is Photo-Realistic Rendering for Visual Learning? , 2016, ECCV Workshops.

[3]  Naila Murray,et al.  Virtual KITTI 2 , 2020, ArXiv.

[4]  Masatoshi Okutomi,et al.  Seamless image cloning by a closed form solution of a modified Poisson problem , 2012, SA '12.

[5]  Vijay Kumar,et al.  Multi-robot Trajectory Generation for an Aerial Payload Transport System , 2017, ISRR.

[6]  Xu Liu,et al.  SLOAM: Semantic Lidar Odometry and Mapping for Forest Inventory , 2019, IEEE Robotics and Automation Letters.

[7]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[8]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Eugenio Culurciello,et al.  ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation , 2016, ArXiv.

[11]  Antonio Manuel López Peña,et al.  Procedural Generation of Videos to Train Deep Action Recognition Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Vladlen Koltun,et al.  Playing for Benchmarks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Jing Li,et al.  Multi-target detection and tracking from a single camera in Unmanned Aerial Vehicles (UAVs) , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[14]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[15]  Jana Kosecka,et al.  Synthesizing Training Data for Object Detection in Indoor Scenes , 2017, Robotics: Science and Systems.

[16]  Giuseppe Loianno,et al.  MAVNet: An Effective Semantic Segmentation Micro-Network for MAV-Based Tasks , 2019, IEEE Robotics and Automation Letters.

[17]  Yu Cheng,et al.  Pedestrian-Synthesis-GAN: Generating Pedestrian Data in Real Scene and Beyond , 2018, ArXiv.

[18]  Michael J. Black,et al.  A Naturalistic Open Source Movie for Optical Flow Evaluation , 2012, ECCV.

[19]  Hod Lipson,et al.  Autonomous drone hunter operating by deep learning and all-onboard computations in GPS-denied environments , 2019, PloS one.

[20]  Sergey I. Nikolenko Synthetic Data for Deep Learning , 2019, ArXiv.

[21]  Stefan Leutenegger,et al.  SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training on Indoor Segmentation? , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Xi Wang,et al.  High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth , 2014, GCPR.

[23]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[24]  Markus Ulrich,et al.  MVTec D2S: Densely Segmented Supermarket Dataset , 2018, ECCV.

[25]  Qiao Wang,et al.  VirtualWorlds as Proxy for Multi-object Tracking Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Anthony Cowley,et al.  Mine Tunnel Exploration Using Multiple Quadrupedal Robots , 2020, IEEE Robotics and Automation Letters.

[27]  Vijay Kumar,et al.  ModQuad: The Flying Modular Structure that Self-Assembles in Midair , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[28]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[29]  Marc Pollefeys,et al.  Slanted Stixels: Representing San Francisco's Steepest Streets , 2017, BMVC.

[30]  Nicolas Riviere,et al.  Deep learning-based strategies for the detection and tracking of drones using several cameras , 2019, IPSJ Trans. Comput. Vis. Appl..

[31]  Eduardo Romera,et al.  ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation , 2018, IEEE Transactions on Intelligent Transportation Systems.

[32]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[33]  Lei Yang,et al.  RPC: A Large-Scale Retail Product Checkout Dataset , 2019, ArXiv.

[34]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Antonio Torralba,et al.  Evaluation of image features using a photorealistic virtual world , 2011, 2011 International Conference on Computer Vision.

[36]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[37]  Mehmet Çağrı Aksoy,et al.  Drone Dataset: Amateur Unmanned Air Vehicle Detection , 2019 .

[38]  Robert Laganière,et al.  How much real data do we actually need: Analyzing object detection performance using synthetic and real data , 2019, ArXiv.

[39]  Richard Szeliski,et al.  A Database and Evaluation Methodology for Optical Flow , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[40]  Lars Petersson,et al.  Effective Use of Synthetic Data for Urban Scene Semantic Segmentation , 2018, ECCV.