Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition

Aerial scene recognition is a fundamental task in remote sensing and has recently received increased interest. While the visual information from overhead images with powerful models and efficient algorithms yields considerable performance on scene recognition, it still suffers from the variation of ground objects, lighting conditions etc. Inspired by the multi-channel perception theory in cognition science, in this paper, for improving the performance on the aerial scene recognition, we explore a novel audiovisual aerial scene recognition task using both images and sounds as input. Based on an observation that some specific sound events are more likely to be heard at a given geographic location, we propose to exploit the knowledge from the sound events to improve the performance on the aerial scene recognition. For this purpose, we have constructed a new dataset named AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE). With the help of this dataset, we evaluate three proposed approaches for transferring the sound event knowledge to the aerial scene recognition task in a multimodal learning framework, and show the benefit of exploiting the audio information for the aerial scene recognition. The source code is publicly available for reproducibility purposes.

[1]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[2]  Luisa Verdoliva,et al.  Land Use Classification in Remote Sensing Images by Convolutional Neural Networks , 2015, ArXiv.

[3]  Xiao Xiang Zhu,et al.  A Relation-Augmented Fully Convolutional Network for Semantic Segmentation in Aerial Scenes , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[5]  Ruslan Salakhutdinov,et al.  Cross-Task Knowledge Transfer for Visually-Grounded Navigation , 2018 .

[6]  Tong Zhang,et al.  Deep Learning Based Feature Selection for Remote Sensing Scene Classification , 2015, IEEE Geoscience and Remote Sensing Letters.

[7]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Lei Guo,et al.  When Deep Learning Meets Metric Learning: Remote Sensing Image Scene Classification via Learning Discriminative CNNs , 2018, IEEE Transactions on Geoscience and Remote Sensing.

[9]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[10]  Vladimir Risojevic,et al.  Aerial image classification using structural texture similarity , 2011, 2011 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT).

[11]  Ryosuke Yamanishi,et al.  Sound Event Detection by Multitask Learning of Sound Events and Scenes with Soft Scene Labels , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Yong Jae Lee,et al.  Audiovisual SlowFast Networks for Video Recognition , 2020, ArXiv.

[13]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[15]  Tatsuya Harada,et al.  Image Reconstruction from Bag-of-Visual-Words , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Gui-Song Xia,et al.  AID: A Benchmark Data Set for Performance Evaluation of Aerial Scene Classification , 2016, IEEE Transactions on Geoscience and Remote Sensing.

[17]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[18]  Chuang Gan,et al.  Self-Supervised Moving Vehicle Tracking With Stereo Sound , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Xuelong Li,et al.  Deep Multimodal Clustering for Unsupervised Audiovisual Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Shawn D. Newsam,et al.  Comparing SIFT descriptors and gabor texture features for classification of remote sensed imagery , 2008, 2008 15th IEEE International Conference on Image Processing.

[21]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[22]  Z. Babic,et al.  Orientation difference descriptor for aerial image classification , 2012, 2012 19th International Conference on Systems, Signals and Image Processing (IWSSIP).

[23]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[24]  Jefersson Alex dos Santos,et al.  Towards better exploiting convolutional neural networks for remote sensing scene classification , 2016, Pattern Recognit..

[25]  Andrew Owens,et al.  Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning , 2017, International Journal of Computer Vision.

[26]  Xuelong Li,et al.  Temporal Multimodal Learning in Audiovisual Speech Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Feiping Nie,et al.  Curriculum Audiovisual Learning , 2020, ArXiv.

[28]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  Trevor Darrell,et al.  Simultaneous Deep Transfer Across Domains and Tasks , 2015, ICCV.

[30]  Kaiqi Huang,et al.  Learning Deep Context-Aware Features over Body and Latent Parts for Person Re-identification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Bo Du,et al.  Scene Classification via a Gradient Boosting Random Convolutional Network Framework , 2016, IEEE Transactions on Geoscience and Remote Sensing.

[34]  Mohamed R. Amer,et al.  Facial Attributes Classification Using Multi-task Representation Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[35]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[36]  Andrzej Cichocki,et al.  EmotionMeter: A Multimodal Framework for Recognizing Human Emotions , 2019, IEEE Transactions on Cybernetics.

[37]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Scott Workman,et al.  A Multimodal Approach to Mapping Soundscapes , 2018, IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium.