论文信息 - Jointly Learning of Visual and Auditory: A New Approach for RS Image and Audio Cross-Modal Retrieval

Jointly Learning of Visual and Auditory: A New Approach for RS Image and Audio Cross-Modal Retrieval

Remote sensing (RS) images are widely used in civilian and military fields. With the highly increasing image data, it has become a challenging issue to achieve fast and efficient RS image retrieval. However, the existing image retrieval methods, text-based or content-based, are still limited in the applications; for example, text input is inefficient, and the sample image for query is often unavailable. It is known that speech is a natural and convenient way of communication. Therefore, a novel speech-image cross-modal retrieval approach, named deep visual-audio network (DVAN), is presented in this article, which can establish the direct relationship between image and speech from paired image-audio data. The model mainly has three parts: 1) Image feature extraction, which is used to extract effective features of RS images; 2) audio feature learning, which is used to recognizing key information from raw data, and AudioNet, as part of DVAN, is proposed to obtain more distinguishing features; 3) multimodal embedding, which is used to learn the direct correlations of two modalities. Experimental results on RS image audio dataset demonstrate that the proposed method is effective and speech-image retrieval is feasible, and it provides a new way for faster and more convenient RS image retrieval.

[1] Rajneesh Talwar,et al. A fast and effective image retrieval scheme using color-, texture-, and shape-based histograms , 2016, Multimedia Tools and Applications.

[2] 李宇 Li Yu,et al. Optical remote sensing image retrieval based on convolutional neural networks , 2018 .

[3] Andrew Owens,et al. Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Rob Fergus,et al. Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[5] Beng Chin Ooi,et al. Effective Multi-Modal Retrieval based on Stacked Auto-Encoders , 2014, Proc. VLDB Endow..

[6] Yuxin Peng,et al. Unsupervised Generative Adversarial Cross-modal Hashing , 2017, AAAI.

[7] Oliver Chiu-sing Choy,et al. An efficient MFCC extraction method in speech recognition , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[8] Geng Guo-hua. Review and research on "semantic gap" problem in the content based image retrieval , 2005 .

[9] Bo Qu,et al. Deep semantic understanding of high resolution remote sensing image , 2016, 2016 International Conference on Computer, Information and Telecommunication Systems (CITS).

[10] Xiangtao Zheng,et al. Exploring Models and Data for Remote Sensing Image Caption Generation , 2017, IEEE Transactions on Geoscience and Remote Sensing.

[11] James R. Glass,et al. Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.

[12] Ramin Zabih,et al. Comparing images using color coherence vectors , 1997, MULTIMEDIA '96.

[13] Yao Zhao,et al. Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[14] Jian Yang,et al. Convolution Neural Networks With Two Pathways for Image Style Recognition , 2017, IEEE Transactions on Image Processing.

[15] Guiguang Ding,et al. Latent semantic sparse hashing for cross-modal similarity search , 2014, SIGIR.

[16] Qiang Chen,et al. Cross-Domain Image Retrieval with a Dual Attribute-Aware Ranking Network , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[19] Qi Wu,et al. Visual question answering: A survey of methods and datasets , 2016, Comput. Vis. Image Underst..

[20] Patrick Pérez,et al. Kernel Square-Loss Exemplar Machines for Image Retrieval , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Mihai Datcu,et al. Exploiting Deep Features for Remote Sensing Image Retrieval: A Systematic Investigation , 2017, IEEE Transactions on Big Data.

[22] Zhenwei Shi,et al. Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image? , 2017, IEEE Transactions on Geoscience and Remote Sensing.

[23] Peng Ren,et al. Partial Randomness Hashing for Large-Scale Remote Sensing Image Retrieval , 2017, IEEE Geoscience and Remote Sensing Letters.

[24] Gui-Song Xia,et al. Extreme value theory-based calibration for the fusion of multiple features in high-resolution satellite scene classification , 2013 .

[25] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Matti Pietikäinen,et al. Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[27] B. S. Manjunath,et al. Texture Features for Browsing and Retrieval of Image Data , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[28] Shawn D. Newsam,et al. Learning Low Dimensional Convolutional Neural Networks for High-Resolution Remote Sensing Image Retrieval , 2016, Remote. Sens..

[29] Shawn D. Newsam,et al. Bag-of-visual-words and spatial extensions for land-use classification , 2010, GIS '10.

[30] Andrew Zisserman,et al. Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[32] C. V. Jawahar,et al. Im2Text and Text2Im: Associating Images and Texts for Cross-Modal Retrieval , 2014, BMVC.

[33] Mihai Datcu,et al. A semantic framework for data retrieval in large remote sensing databases , 2012, 2012 IEEE International Geoscience and Remote Sensing Symposium.

[34] Mihai Datcu,et al. Latent Dirichlet Allocation for Spatial Analysis of Satellite Images , 2013, IEEE Transactions on Geoscience and Remote Sensing.

[35] Armand Joulin,et al. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[36] Xuelong Li,et al. Bidirectional Adaptive Feature Fusion for Remote Sensing Scene Classification , 2017, CCCV.

[37] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38] Li Zhuo,et al. Hyperspectral remote sensing image retrieval system using spectral and texture features. , 2017, Applied optics.

[39] Amit Sharma,et al. Speech Emotion Recognition , 2015 .

[40] Wei Wang,et al. A Comprehensive Survey on Cross-modal Retrieval , 2016, ArXiv.

[41] Mihai Datcu,et al. Multilabel Annotation of Multispectral Remote Sensing Images using Error-Correcting Output Codes and Most Ambiguous Examples , 2019, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[42] Marcel Worring,et al. Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[43] Xuelong Li,et al. Hierarchical Recurrent Neural Hashing for Image Retrieval With Hierarchical Convolutional Features , 2018, IEEE Transactions on Image Processing.

[44] Xuelong Li,et al. FFGS: Feature Fusion with Gating Structure for Image Caption Generation , 2017, CCCV.

[45] Wei Liu,et al. Discriminative Dictionary Learning With Common Label Alignment for Cross-Modal Retrieval , 2016, IEEE Transactions on Multimedia.

[46] Jefersson Alex dos Santos,et al. Evaluating the Potential of Texture and Color Descriptors for Remote Sensing Image Retrieval and Classification , 2010, VISAPP.

[47] Gang Wang,et al. Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48] Xuelong Li,et al. Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval , 2017, IEEE Transactions on Image Processing.

[49] Joon Son Chung,et al. Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50] Li Deren. A New Image Decomposition Method for Content-Based Remote Sensing Image Retrieval , 2006 .

[51] Amirsina Torfi,et al. 3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition , 2017, IEEE Access.

[52] Yuan Yuan,et al. Deep Cross-Modal Retrieval for Remote Sensing Image and Audio , 2018, 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS).

[53] Tele Tan,et al. An Improved Method for Image Retrieval Using Speech Annotation , 2003, MMM.

[54] Qi Wu,et al. FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55] Lorenzo Bruzzone,et al. A Novel Active Learning Method in Relevance Feedback for Content-Based Remote Sensing Image Retrieval , 2015, IEEE Transactions on Geoscience and Remote Sensing.

[56] James R. Glass,et al. Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[57] Alexander I. Rudnicky,et al. Towards efficient human machine speech communication: The speech graffiti project , 2005, TSLP.

[58] Larry S. Davis,et al. Exploiting local features from deep networks for image retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[59] Chung-Hsien Wu,et al. Speech Emotion Recognition Using Deep Neural Network Considering Verbal and Nonverbal Speech Sounds , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60] Guiguang Ding,et al. Collective Matrix Factorization Hashing for Multimodal Data , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[61] Wu-Jun Li,et al. Deep Cross-Modal Hashing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62] Alexei A. Efros,et al. Toward Multimodal Image-to-Image Translation , 2017, NIPS.

[63] Bayya Yegnanarayana,et al. Unsupervised query-by-example spoken term detection using segment-based Bag of Acoustic Words , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[64] S. Lalitha,et al. Speech emotion recognition , 2014, 2014 International Conference on Advances in Electronics Computers and Communications.

[65] Huimin Lu,et al. Unsupervised cross-modal retrieval through adversarial learning , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[66] Antonio Plaza,et al. Scale-Free Convolutional Neural Network for Remote Sensing Scene Classification , 2019, IEEE Transactions on Geoscience and Remote Sensing.

[67] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[68] F. Reena Sharma,et al. A Speech Recognition and Synthesis Tool : Assistive Technology for Physically Disabled Persons , 2012 .