Learning to Map the Visual and Auditory World

OF DISSERTATION Learning to Map the Visual and Auditory World The appearance of the world varies dramatically not only from place to place but also from hour to hour and month to month. Billions of images that capture this complex relationship are uploaded to social-media websites every day and often are associated with precise time and location metadata. This rich source of data can be beneficial to improve our understanding of the globe. In this work, we propose a general framework that uses these publicly available images for constructing dense maps of different ground-level attributes from overhead imagery. In particular, we use well-defined probabilistic models and a weakly-supervised, multi-task training strategy to provide an estimate of the expected visual and auditory ground-level attributes consisting of the type of scenes, objects, and sounds a person can experience at a location. Through a large-scale evaluation on real data, we show that our learned models can be used for applications including mapping, image localization, image retrieval, and metadata verification.

[1]  Fei-Fei Li,et al.  Improving Image Classification with Location Context , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  S. Janssen,et al.  Auditory and non-auditory effects of noise on health , 2014, The Lancet.

[3]  Rossano Schifanella,et al.  Chatty maps: constructing sound maps of urban areas from social media data , 2016, Royal Society Open Science.

[4]  Scott Workman,et al.  Analyzing human appearance as a cue for dating images , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[5]  Tal Hassner,et al.  Age and gender classification using convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[6]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[7]  Yuan Shi,et al.  Geodesic flow kernel for unsupervised domain adaptation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Scott Workman,et al.  Predicting Ground-Level Scene Layout from Aerial Imagery , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  David J. Crandall,et al.  Observing the Natural World with Flickr , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[10]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Jing Wang,et al.  Walk and Learn: Facial Attribute Representation Learning from Egocentric Video and Contextual Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Serge J. Belongie,et al.  Learning deep representations for ground-to-aerial geolocalization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Alexei A. Efros,et al.  Dating Historical Color Images , 2012, ECCV.

[14]  Stan Z. Li,et al.  Age Estimation by Multi-scale Convolutional Network , 2014, ACCV.

[15]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[16]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[17]  Hui Wu,et al.  Exploring the geo-dependence of human face appearance , 2014, IEEE Winter Conference on Applications of Computer Vision.

[18]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[19]  Stefan Lee,et al.  Predicting Geo-informative Attributes in Large-Scale Image Collections Using Convolutional Neural Networks , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[20]  S. Newsam,et al.  IM2MAP: deriving maps from georeferenced community contributed photo collections , 2011, WSM '11.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[23]  Nathan Jacobs,et al.  Remote Estimation of Free-Flow Speeds , 2019, IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium.

[24]  Shawn D. Newsam,et al.  Proximate sensing: Inferring what-is-where from georeferenced photo collections , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  H. Farid,et al.  Image forgery detection , 2009, IEEE Signal Processing Magazine.

[26]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[27]  Ali Borji,et al.  Cross-View Image Synthesis Using Conditional GANs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Connor Greenwell,et al.  What Goes Where: Predicting Object Distributions from Above , 2018, IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium.

[29]  Alessandro Piva,et al.  Image Forgery Localization via Block-Grained Analysis of JPEG Artifacts , 2012, IEEE Transactions on Information Forensics and Security.

[30]  David A. Shamma,et al.  The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[31]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[32]  Scott Workman,et al.  On the location dependence of convolutional neural network features , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[33]  Alexei A. Efros,et al.  Linking Past to Present: Discovering Style in Two Centuries of Architecture , 2015, 2015 IEEE International Conference on Computational Photography (ICCP).

[34]  Pascal Fua,et al.  Hot or Not: Exploring Correlations between Appearance and Temperature , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Scott Workman,et al.  Sky segmentation in the wild: An empirical study , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[37]  Connor Greenwell,et al.  Learning to Map Nearly Anything , 2019, IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium.

[38]  Anderson Rocha,et al.  Beyond Pixels: Image Provenance Analysis Leveraging Metadata , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[39]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[40]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[41]  Nathan Jacobs,et al.  Who goes there?: approaches to mapping facial appearance diversity , 2016, SIGSPATIAL/GIS.

[42]  Tinne Tuytelaars,et al.  Unsupervised Visual Domain Adaptation Using Subspace Alignment , 2013, 2013 IEEE International Conference on Computer Vision.

[43]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[44]  Trevor Darrell,et al.  Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  David J. Crandall,et al.  Tracking Natural Events through Social Media and Computer Vision , 2016, ACM Multimedia.

[46]  Ramesh Raskar,et al.  Deep Learning the City: Quantifying Urban Perception at a Global Scale , 2016, ECCV.

[47]  Scott Workman,et al.  A Unified Model for Near and Remote Sensing , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Yi Zhu,et al.  What is it like down there?: generating dense ground-level views and image features from overhead imagery using conditional generative adversarial networks , 2018, SIGSPATIAL/GIS.

[49]  Noah Snavely,et al.  Scene Chronology , 2014, ECCV.

[50]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[51]  S. Stansfeld,et al.  Noise and Health in the Urban Environment , 2000, Reviews on environmental health.

[52]  Ilya Kostrikov,et al.  PlaNet - Photo Geolocation with Convolutional Neural Networks , 2016, ECCV.

[53]  Scott Workman,et al.  Understanding and Mapping Natural Beauty , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[55]  Scott Workman,et al.  A Multimodal Approach to Mapping Soundscapes , 2018, IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium.

[56]  Yu Zhang,et al.  Exploit Bounding Box Annotations for Multi-Label Object Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Jonathan G. Fiscus,et al.  MFC Datasets: Large-Scale Benchmark Datasets for Media Forensic Challenge Evaluation , 2019, 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW).

[58]  Cewu Lu,et al.  Two-Class Weather Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Jiebo Luo,et al.  Event recognition: viewing the world with a third eye , 2008, ACM Multimedia.

[60]  Robert Pless,et al.  Learning Geo-Temporal Image Features , 2019, BMVC.

[61]  Connor Greenwell,et al.  A fast method for estimating transient scene attributes , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[62]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Scott Workman,et al.  Wide-Area Image Geolocalization with Aerial Reference Imagery , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[64]  Xiaofeng Tao,et al.  Transient attributes for high-level understanding and editing of outdoor scenes , 2014, ACM Trans. Graph..

[65]  Alexei A. Efros,et al.  IM2GPS: estimating geographic information from a single image , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[66]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[67]  Serge J. Belongie,et al.  Cross-View Image Geolocalization , 2013, CVPR.

[68]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[69]  Robert Pless,et al.  Consistent Temporal Variations in Many Outdoor Scenes , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[70]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[71]  Alexei A. Efros,et al.  City Forensics: Using Visual Elements to Predict Non-Visual City Attributes , 2014, IEEE Transactions on Visualization and Computer Graphics.

[72]  Tobias Preis,et al.  Quantifying the Impact of Scenic Environments on Health , 2015, Scientific Reports.

[73]  Jonathan Krause,et al.  Using deep learning and Google Street View to estimate the demographic makeup of neighborhoods across the United States , 2017, Proceedings of the National Academy of Sciences.

[74]  Tinne Tuytelaars,et al.  Color features for dating historical color images , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[75]  Yong Jae Lee,et al.  Style-Aware Mid-level Representation for Discovering Visual Connections in Space and Time , 2013, 2013 IEEE International Conference on Computer Vision.

[76]  Jiwen Lu,et al.  Modality and Component Aware Feature Fusion for RGB-D Scene Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).