Predicting Ground-Level Scene Layout from Aerial Imagery

We introduce a novel strategy for learning to extract semantically meaningful features from aerial imagery. Instead of manually labeling the aerial imagery, we propose to predict (noisy) semantic features automatically extracted from co-located ground imagery. Our network architecture takes an aerial image as input, extracts features using a convolutional neural network, and then applies an adaptive transformation to map these features into the ground-level perspective. We use an end-to-end learning approach to minimize the difference between the semantic segmentation extracted directly from the ground image and the semantic segmentation predicted solely based on the aerial image. We show that a model learned using this strategy, with no additional training, is already capable of rough semantic labeling of aerial imagery. Furthermore, we demonstrate that by finetuning this model we can achieve more accurate semantic segmentation than two baseline initialization strategies. We use our network to address the task of estimating the geolocation and geo-orientation of a ground image. Finally, we show how features extracted from an aerial image can be used to hallucinate a plausible ground-level panorama.

[1]  Serge J. Belongie,et al.  Learning deep representations for ground-to-aerial geolocalization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3]  Scott Workman,et al.  Wide-Area Image Geolocalization with Aerial Reference Imagery , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Jamie Sherrah,et al.  Effective semantic pixel labelling with convolutional networks and Conditional Random Fields , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[6]  Dong Liu,et al.  Robust visual domain adaptation with low-rank reconstruction , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[8]  Pietro Perona,et al.  Cataloging Public Objects Using Aerial and Street-Level Images — Urban Trees , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  James Hays,et al.  Localizing and Orienting Street Views Using Overhead Imagery , 2016, ECCV.

[10]  Rama Chellappa,et al.  Visual Domain Adaptation: A survey of recent advances , 2015, IEEE Signal Processing Magazine.

[11]  Thomas Mauthner,et al.  Semantic Classification in Aerial Imagery by Integrating Appearance and Height Information , 2009, ACCV.

[12]  Xinlei Chen,et al.  PixelNet: Towards a General Pixel-level Architecture , 2016, ArXiv.

[13]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ramakant Nevatia,et al.  Detecting buildings in aerial images , 1988, Comput. Vis. Graph. Image Process..

[15]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[16]  Sébastien Lefèvre,et al.  Coupling ground-level panoramas and aerial imagery for change detection , 2016, Geo spatial Inf. Sci..

[17]  Franz Rottensteiner,et al.  ISPRS Test Project on Urban Classification and 3D Building Reconstruction: Evaluation of Building Reconstruction Results , 2009 .

[18]  Geoffrey E. Hinton,et al.  Learning to Detect Roads in High-Resolution Aerial Images , 2010, ECCV.

[19]  Junwei Han,et al.  A Survey on Object Detection in Optical Remote Sensing Images , 2016, ArXiv.

[20]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Serge J. Belongie,et al.  Cross-View Image Geolocalization , 2013, CVPR.

[23]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[24]  Sanja Fidler,et al.  HD Maps: Fine-Grained Road Segmentation by Parsing Ground and Aerial Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Steven M. Seitz,et al.  Filter flow , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[26]  Daniel Marcu,et al.  Domain Adaptation for Statistical Classifiers , 2006, J. Artif. Intell. Res..

[27]  Scott Workman,et al.  On the location dependence of convolutional neural network features , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[28]  Yoshua Bengio,et al.  Deep Directed Generative Models with Energy-Based Probability Estimation , 2016, ArXiv.

[29]  Geoffrey E. Hinton,et al.  Learning to Label Aerial Images from Noisy Data , 2012, ICML.

[30]  Jitendra Malik,et al.  View Synthesis by Appearance Flow , 2016, ECCV.

[31]  Frank Ade,et al.  A rule-based system for house reconstruction from aerial images , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[32]  Jiebo Luo,et al.  Event recognition: viewing the world with a third eye , 2008, ACM Multimedia.

[33]  Huanxin Zou,et al.  Unsupervised Cross-View Semantic Transfer for Remote Sensing Image Classification , 2016, IEEE Geoscience and Remote Sensing Letters.

[34]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.