论文信息 - Future Urban Scenes Generation Through Vehicles Synthesis

Future Urban Scenes Generation Through Vehicles Synthesis

In this work we propose a deep learning pipeline to predict the visual future appearance of an urban scene. Despite recent advances, generating the entire scene in an end-to-end fashion is still far from being achieved. Instead, here we follow a two stages approach, where interpretable information is included in the loop and each actor is modelled independently. We leverage a per-object novel view synthesis paradigm; i.e. generating a synthetic representation of an object undergoing a geometrical roto-translation in the 3D space. Our model can be easily conditioned with constraints (e.g. input trajectories) provided by state-of-the-art tracking methods or by the user itself. This allows us to generate a set of diverse realistic futures starting from the same input in a multi-modal fashion. We visually and quantitatively show the superiority of this approach over traditional end-to-end scene-generation methods on CityFlow, a challenging real world dataset.

[1] Mehran Ebrahimi,et al. EdgeConnect: Generative Image Inpainting with Adversarial Edge Learning , 2019, ArXiv.

[2] Xiaojuan Qi,et al. 3D Motion Decomposition for RGBD Future Dynamic Scene Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Jitendra Malik,et al. View Synthesis by Appearance Flow , 2016, ECCV.

[4] N. Dinesh Reddy,et al. CarFusion: Combining Point Tracking and Part Detection for Dynamic 3D Reconstruction of Vehicles , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5] Mayank Bansal,et al. ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst , 2018, Robotics: Science and Systems.

[6] 拓海杉山,et al. “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[7] Scott E. Reed,et al. Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis , 2015, NIPS.

[8] Rob Fergus,et al. Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[9] Simon Osindero,et al. Conditional Generative Adversarial Nets , 2014, ArXiv.

[10] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11] Vladlen Koltun,et al. Open3D: A Modern Library for 3D Data Processing , 2018, ArXiv.

[12] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[13] Léon Bottou,et al. Wasserstein Generative Adversarial Networks , 2017, ICML.

[14] Andrea Palazzi,et al. Warp and Learn: Novel Views Generation for Vehicles and Other Objects. , 2020, IEEE transactions on pattern analysis and machine intelligence.

[15] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[16] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[17] Abhishek Das,et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[18] Gabriel Kreiman,et al. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning , 2016, ICLR.

[19] Björn Ommer,et al. A Variational U-Net for Conditional Appearance and Shape Generation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20] Jan Kautz,et al. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21] Luc Van Gool,et al. Pose Guided Person Image Generation , 2017, NIPS.

[22] Eero P. Simoncelli,et al. Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[23] Ersin Yumer,et al. Transformation-Grounded Image Generation Network for Novel 3D View Synthesis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Andrew Zisserman,et al. Spatial Transformer Networks , 2015, NIPS.

[25] Jenq-Neng Hwang,et al. Exploit the Connectivity: Multi-Object Tracking with TrackletNet , 2018, ACM Multimedia.

[26] Jenq-Neng Hwang,et al. CityFlow: A City-Scale Benchmark for Multi-Target Multi-Camera Vehicle Tracking and Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Thomas Brox,et al. Multi-view 3D Models from Single Images with a Convolutional Network , 2015, ECCV.

[28] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[29] Alexei A. Efros,et al. Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Benjamin Sapp,et al. Rules of the Road: Predicting Driving Behavior With a Convolutional Model of Semantic Interactions , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32] Yann LeCun,et al. Disentangling factors of variation in deep representation using adversarial training , 2016, NIPS.

[33] Min-Gyu Park,et al. Predicting Future Frames Using Retrospective Cycle GAN , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Mario Lucic,et al. Are GANs Created Equal? A Large-Scale Study , 2017, NeurIPS.

[35] Ruben Villegas,et al. High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks , 2019, NeurIPS.

[36] Luc Claesen,et al. PoseLab: A Levenberg-Marquardt Based Prototyping Environment for Camera Pose Estimation , 2018, 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI).

[37] Wei Liu,et al. SSD: Single Shot MultiBox Detector , 2015, ECCV.

[38] Silvio Savarese,et al. Beyond PASCAL: A benchmark for 3D object detection in the wild , 2014, IEEE Winter Conference on Applications of Computer Vision.

[39] Jia Deng,et al. Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[40] Bo Zhao,et al. Multi-View Image Generation from a Single-View , 2017, ACM Multimedia.

[41] Wojciech Zaremba,et al. Improved Techniques for Training GANs , 2016, NIPS.

[42] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[43] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[44] Thomas Brox,et al. Striving for Simplicity: The All Convolutional Net , 2014, ICLR.