Jointly Learning of Visual and Auditory: A New Approach for RS Image and Audio Cross-Modal Retrieval

Remote sensing (RS) images are widely used in civilian and military fields. With the highly increasing image data, it has become a challenging issue to achieve fast and efficient RS image retrieval. However, the existing image retrieval methods, text-based or content-based, are still limited in the applications; for example, text input is inefficient, and the sample image for query is often unavailable. It is known that speech is a natural and convenient way of communication. Therefore, a novel speech-image cross-modal retrieval approach, named deep visual-audio network (DVAN), is presented in this article, which can establish the direct relationship between image and speech from paired image-audio data. The model mainly has three parts: 1) Image feature extraction, which is used to extract effective features of RS images; 2) audio feature learning, which is used to recognizing key information from raw data, and AudioNet, as part of DVAN, is proposed to obtain more distinguishing features; 3) multimodal embedding, which is used to learn the direct correlations of two modalities. Experimental results on RS image audio dataset demonstrate that the proposed method is effective and speech-image retrieval is feasible, and it provides a new way for faster and more convenient RS image retrieval.

[1]  Rajneesh Talwar,et al.  A fast and effective image retrieval scheme using color-, texture-, and shape-based histograms , 2016, Multimedia Tools and Applications.

[2]  李宇 Li Yu,et al.  Optical remote sensing image retrieval based on convolutional neural networks , 2018 .

[3]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[5]  Beng Chin Ooi,et al.  Effective Multi-Modal Retrieval based on Stacked Auto-Encoders , 2014, Proc. VLDB Endow..

[6]  Yuxin Peng,et al.  Unsupervised Generative Adversarial Cross-modal Hashing , 2017, AAAI.

[7]  Oliver Chiu-sing Choy,et al.  An efficient MFCC extraction method in speech recognition , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[8]  Geng Guo-hua Review and research on "semantic gap" problem in the content based image retrieval , 2005 .

[9]  Bo Qu,et al.  Deep semantic understanding of high resolution remote sensing image , 2016, 2016 International Conference on Computer, Information and Telecommunication Systems (CITS).

[10]  Xiangtao Zheng,et al.  Exploring Models and Data for Remote Sensing Image Caption Generation , 2017, IEEE Transactions on Geoscience and Remote Sensing.

[11]  James R. Glass,et al.  Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.

[12]  Ramin Zabih,et al.  Comparing images using color coherence vectors , 1997, MULTIMEDIA '96.

[13]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[14]  Jian Yang,et al.  Convolution Neural Networks With Two Pathways for Image Style Recognition , 2017, IEEE Transactions on Image Processing.

[15]  Guiguang Ding,et al.  Latent semantic sparse hashing for cross-modal similarity search , 2014, SIGIR.

[16]  Qiang Chen,et al.  Cross-Domain Image Retrieval with a Dual Attribute-Aware Ranking Network , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[19]  Qi Wu,et al.  Visual question answering: A survey of methods and datasets , 2016, Comput. Vis. Image Underst..

[20]  Patrick Pérez,et al.  Kernel Square-Loss Exemplar Machines for Image Retrieval , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Mihai Datcu,et al.  Exploiting Deep Features for Remote Sensing Image Retrieval: A Systematic Investigation , 2017, IEEE Transactions on Big Data.

[22]  Zhenwei Shi,et al.  Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image? , 2017, IEEE Transactions on Geoscience and Remote Sensing.

[23]  Peng Ren,et al.  Partial Randomness Hashing for Large-Scale Remote Sensing Image Retrieval , 2017, IEEE Geoscience and Remote Sensing Letters.

[24]  Gui-Song Xia,et al.  Extreme value theory-based calibration for the fusion of multiple features in high-resolution satellite scene classification , 2013 .

[25]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  B. S. Manjunath,et al.  Texture Features for Browsing and Retrieval of Image Data , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Shawn D. Newsam,et al.  Learning Low Dimensional Convolutional Neural Networks for High-Resolution Remote Sensing Image Retrieval , 2016, Remote. Sens..

[29]  Shawn D. Newsam,et al.  Bag-of-visual-words and spatial extensions for land-use classification , 2010, GIS '10.

[30]  Andrew Zisserman,et al.  Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[32]  C. V. Jawahar,et al.  Im2Text and Text2Im: Associating Images and Texts for Cross-Modal Retrieval , 2014, BMVC.

[33]  Mihai Datcu,et al.  A semantic framework for data retrieval in large remote sensing databases , 2012, 2012 IEEE International Geoscience and Remote Sensing Symposium.

[34]  Mihai Datcu,et al.  Latent Dirichlet Allocation for Spatial Analysis of Satellite Images , 2013, IEEE Transactions on Geoscience and Remote Sensing.

[35]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[36]  Xuelong Li,et al.  Bidirectional Adaptive Feature Fusion for Remote Sensing Scene Classification , 2017, CCCV.

[37]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Li Zhuo,et al.  Hyperspectral remote sensing image retrieval system using spectral and texture features. , 2017, Applied optics.

[39]  Amit Sharma,et al.  Speech Emotion Recognition , 2015 .

[40]  Wei Wang,et al.  A Comprehensive Survey on Cross-modal Retrieval , 2016, ArXiv.

[41]  Mihai Datcu,et al.  Multilabel Annotation of Multispectral Remote Sensing Images using Error-Correcting Output Codes and Most Ambiguous Examples , 2019, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[42]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[43]  Xuelong Li,et al.  Hierarchical Recurrent Neural Hashing for Image Retrieval With Hierarchical Convolutional Features , 2018, IEEE Transactions on Image Processing.

[44]  Xuelong Li,et al.  FFGS: Feature Fusion with Gating Structure for Image Caption Generation , 2017, CCCV.

[45]  Wei Liu,et al.  Discriminative Dictionary Learning With Common Label Alignment for Cross-Modal Retrieval , 2016, IEEE Transactions on Multimedia.

[46]  Jefersson Alex dos Santos,et al.  Evaluating the Potential of Texture and Color Descriptors for Remote Sensing Image Retrieval and Classification , 2010, VISAPP.

[47]  Gang Wang,et al.  Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Xuelong Li,et al.  Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval , 2017, IEEE Transactions on Image Processing.

[49]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Li Deren A New Image Decomposition Method for Content-Based Remote Sensing Image Retrieval , 2006 .

[51]  Amirsina Torfi,et al.  3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition , 2017, IEEE Access.

[52]  Yuan Yuan,et al.  Deep Cross-Modal Retrieval for Remote Sensing Image and Audio , 2018, 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS).

[53]  Tele Tan,et al.  An Improved Method for Image Retrieval Using Speech Annotation , 2003, MMM.

[54]  Qi Wu,et al.  FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Lorenzo Bruzzone,et al.  A Novel Active Learning Method in Relevance Feedback for Content-Based Remote Sensing Image Retrieval , 2015, IEEE Transactions on Geoscience and Remote Sensing.

[56]  James R. Glass,et al.  Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[57]  Alexander I. Rudnicky,et al.  Towards efficient human machine speech communication: The speech graffiti project , 2005, TSLP.

[58]  Larry S. Davis,et al.  Exploiting local features from deep networks for image retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[59]  Chung-Hsien Wu,et al.  Speech Emotion Recognition Using Deep Neural Network Considering Verbal and Nonverbal Speech Sounds , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60]  Guiguang Ding,et al.  Collective Matrix Factorization Hashing for Multimodal Data , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[61]  Wu-Jun Li,et al.  Deep Cross-Modal Hashing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Alexei A. Efros,et al.  Toward Multimodal Image-to-Image Translation , 2017, NIPS.

[63]  Bayya Yegnanarayana,et al.  Unsupervised query-by-example spoken term detection using segment-based Bag of Acoustic Words , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[64]  S. Lalitha,et al.  Speech emotion recognition , 2014, 2014 International Conference on Advances in Electronics Computers and Communications.

[65]  Huimin Lu,et al.  Unsupervised cross-modal retrieval through adversarial learning , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[66]  Antonio Plaza,et al.  Scale-Free Convolutional Neural Network for Remote Sensing Scene Classification , 2019, IEEE Transactions on Geoscience and Remote Sensing.

[67]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[68]  F. Reena Sharma,et al.  A Speech Recognition and Synthesis Tool : Assistive Technology for Physically Disabled Persons , 2012 .