PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer Vision

In recent years, person detection and human pose estimation have made great strides, helped by large-scale labeled datasets. However, these datasets had no guarantees or analysis of human activities, poses, or context diversity. Additionally, privacy, legal, safety, and ethical concerns may limit the ability to collect more human data. An emerging alternative to real-world data that alleviates some of these issues is synthetic data. However, creation of synthetic data generators is incredibly challenging and prevents researchers from exploring their usefulness. Therefore, we release a human-centric synthetic data generator PEOPLESANSPEOPLE which contains simulation-ready 3D human assets, a parameterized lighting and camera system, and generates 2D and 3D bounding box, instance and semantic segmentation, and COCO pose labels. Using PEOPLESANSPEOPLE, we performed benchmark synthetic data training using a Detectron2 Keypoint R-CNN variant [1]. We found that pre-training a network using synthetic data and fine-tuning on target real-world data (few-shot transfer to limited subsets of COCO-person train [2]) resulted in a keypoint AP of 60.37 ± 0.48 (COCO test-dev2017) outperforming models trained with the same real data alone (keypoint AP of 55.80) and pre-trained with ImageNet (keypoint AP of 57.50). This freely-available data generator1 should enable a wide range of research into the emerging field of simulation to real transfer learning in the critical area of human-centric computer vision.

[1]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[2]  Germán Ros,et al.  CARLA: An Open Urban Driving Simulator , 2017, CoRL.

[3]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Magnus Wrenninge,et al.  Synscapes: A Photorealistic Synthetic Dataset for Street Scene Parsing , 2018, ArXiv.

[5]  Silvio Savarese,et al.  iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes , 2020, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Christoph Heindl,et al.  BlendTorch: A Real-Time, Adaptive Domain Randomization Library , 2020, ICPR Workshops.

[8]  Mike Roberts,et al.  Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Antonio Manuel López Peña,et al.  Procedural Generation of Videos to Train Deep Action Recognition Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[11]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Cordelia Schmid,et al.  Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Joachim Tesch,et al.  Populating 3D Scenes by Learning Human-Scene Interaction , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Dieter Fox,et al.  Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects , 2018, CoRL.

[15]  Sujoy Ganguly,et al.  Unity Perception: Generate Synthetic Data for Computer Vision , 2021, ArXiv.

[16]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[17]  Robert Bridson,et al.  Fast Poisson disk sampling in arbitrary dimensions , 2007, SIGGRAPH '07.

[18]  Andrea Palazzi,et al.  Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World , 2018, ECCV.

[19]  Stanley T. Birchfield,et al.  Structured Domain Randomization: Bridging the Reality Gap by Context-Aware Synthetic Data , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[20]  Jitendra Malik,et al.  Habitat 2.0: Training Home Assistants to Rearrange their Habitat , 2021, ArXiv.

[21]  Stephen Tyree,et al.  NViSII: A Scriptable Tool for Photorealistic Image Generation , 2021, ArXiv.

[22]  Gerard Medioni,et al.  From Real to Synthetic and Back: Synthesizing Training Data for Multi-Person Scene Understanding , 2020, ArXiv.

[23]  Sanja Fidler,et al.  Meta-Sim2: Unsupervised Learning of Scene Structure for Synthetic Data Generation , 2020, ECCV.

[24]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Qiao Wang,et al.  VirtualWorlds as Proxy for Multi-object Tracking Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[27]  Kalyan Sunkavalli,et al.  OpenRooms: An Open Framework for Photorealistic Indoor Scene Datasets , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Bernt Schiele,et al.  Learning people detection models from few training samples , 2011, CVPR 2011.

[29]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[31]  Kaiming He,et al.  Rethinking ImageNet Pre-Training , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Varun Jampani,et al.  Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[33]  Preston Bukaty The California Consumer Privacy Act (CCPA) , 2019 .

[34]  Slawomir Bak,et al.  Domain Adaptation through Synthesis for Unsupervised Person Re-identification , 2018, ECCV.

[35]  Hao Zhu,et al.  CrowdPose: Efficient Crowded Scenes Pose Estimation and a New Benchmark , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Alexander G. Schwing,et al.  SAIL-VOS: Semantic Amodal Instance Level Video Object Segmentation – A Synthetic Dataset and Baselines , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Vladlen Koltun,et al.  Playing for Benchmarks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[39]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Marc T. Law,et al.  Sim2SG: Sim-to-Real Scene Graph Generation for Transfer Learning , 2020, ArXiv.

[42]  Stefan Hinterstoißer,et al.  An Annotation Saved is an Annotation Earned: Using Fully Synthetic Training for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[43]  Alexander G. Schwing,et al.  SAIL-VOS 3D: A Synthetic Dataset and Baselines for Object Detection and 3D Mesh Reconstruction from Video Data , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[44]  Paul Voigt,et al.  The Eu General Data Protection Regulation (Gdpr): A Practical Guide , 2017 .

[45]  Sanja Fidler,et al.  Meta-Sim: Learning to Generate Synthetic Datasets , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).