Model-Driven Simulations for Computer Vision

There is a growing interest to utilize Computer Graphics (CG) renderings to generate large scale annotated data in order to train machine learning systems, such as Deep convolutional neural networks, for Computer Vision (CV). However, there has been a long debate on the usefulness of CG generated data for tuning CV systems (even from the 1980's). Especially, the impact of modeling errors and computational rendering approximations, due to choices in the rendering pipeline, on trained CV systems generalization performance is still not clear. In this paper, we take a case study in traffic scenario to empirically analyze the performance degradation when CV systems trained with virtual data are transferred to real data. We: a) discuss a generative model coupled with 3D CAD shapes for scene instance synthesis and, b) explore system performance tradeoffs due to the choice of rendering engine (e.g. Lambertian shader (LS), ray-tracing (RT), and Monte-carlo path tracing (MCPT)) and their respective parameters. DeepLab, that performs semantic segmentation, is chosen as the CV system being evaluated. In our case study, involving traffic scenes, when the CV system is trained with CG data samples (that use MCPT or RT) and augmented with only 10% of real-world training data from CityScapes dataset, the performance levels achieved are comparable to that of training DeepLab with the complete CityScapes dataset. Use of samples from LS degraded the performance of DeepLab by 20%. Physics-based MCPT rendering improved the performance by 6% but at the cost of more than 3 times the rendering time.

[1]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Robert M. Haralick Performance Characterization in Computer Vision , 1992, BMVC.

[3]  Jiaolong Xu,et al.  Domain Adaptation of Deformable Part-Based Models , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Qiao Wang,et al.  VirtualWorlds as Proxy for Multi-object Tracking Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[6]  Greg Humphreys,et al.  Physically Based Rendering: From Theory to Implementation , 2004 .

[7]  S. Meister,et al.  Real versus realistically rendered scenes for optical flow evaluation , 2011, 2011 14th ITG Conference on Electronic Media Technology.

[8]  Daphne Koller,et al.  Learning Spatial Context: Using Stuff to Find Things , 2008, ECCV.

[9]  Florent Lafarge,et al.  Geometric Feature Extraction by a Multimarked Point Process , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  James T. Kajiya,et al.  The rendering equation , 1998 .

[11]  Robert T. Collins,et al.  Marked point processes for crowd counting , 2009, CVPR.

[12]  Vladlen Koltun,et al.  Playing for Data: Ground Truth from Computer Games , 2016, ECCV.

[13]  Roberto Cipolla,et al.  Semantic object classes in video: A high-definition ground truth database , 2009, Pattern Recognit. Lett..

[14]  Antonio M. López,et al.  Virtual and Real World Adaptation for Pedestrian Detection , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Song Wu,et al.  3 D ShapeNets : A Deep Representation for Volumetric Shape Modeling , 2015 .

[17]  Shree K. Nayar,et al.  Generalization of the Lambertian model and implications for machine vision , 1995, International Journal of Computer Vision.

[18]  Michael J. Black,et al.  A Naturalistic Open Source Movie for Optical Flow Evaluation , 2012, ECCV.

[19]  James J. Little,et al.  Play and Learn: Using Video Games to Train Computer Vision Models , 2016, BMVC.

[20]  Rafael Bidarra,et al.  A Survey on Procedural Modelling for Virtual Worlds , 2014, Comput. Graph. Forum.

[21]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  David Vázquez Cool world : domain adaptation of virtual and real worlds for human detection using active learning , 2012 .

[23]  T. Vaudrey,et al.  Differences between stereo and motion behaviour on synthetic and real-world stereo sequences , 2008, 2008 23rd International Conference Image and Vision Computing New Zealand.

[24]  Shree K. Nayar,et al.  Vision and the Atmosphere , 2002, International Journal of Computer Vision.

[25]  Slobodan Ilic,et al.  Framework for Generation of Synthetic Ground Truth Data for Driver Assistance Applications , 2013, GCPR.

[26]  A. Khosla,et al.  A Deep Representation for Volumetric Shape Modeling , 2015 .

[27]  Pushmeet Kohli,et al.  Robust Higher Order Potentials for Enforcing Label Consistency , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Visvanathan Ramesh,et al.  Model-driven Simulations for Deep Convolutional Neural Networks , 2016, ArXiv.

[29]  David Vázquez,et al.  Unsupervised domain adaptation of virtual and real worlds for pedestrian detection , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[30]  Pascal Müller,et al.  Procedural modeling of cities , 2001, SIGGRAPH.

[31]  M. Jacobsen Point Process Theory and Applications: Marked Point and Piecewise Deterministic Processes , 2005 .

[32]  Robert M. Haralick Methodology for experimental computer vision , 1989, Proceedings CVPR '89: IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[33]  P. Kulalvaimozhi.V.,et al.  Performance Analysis of Virtual Human Bodies with Clothing and Hair from Images to Animation , 2018 .