Variational Autoencoder with CCA for Audio–Visual Cross-modal Retrieval

JIWEI ZHANG∗, Digital Content and Media Sciences Research Division, National Institute of Informatics, Japan YI YU, Digital Content and Media Sciences Research Division, National Institute of Informatics, Japan SUHUA TANG, Department of Computer and Network Engineering, Graduate School of Informatics and Engineering, The University of Electro-Communications, Japan JIANMING WU, KDDI Research, Inc, Japan WEI LI, School of Computer Science, Fudan University, China

[1]  Dezhong Peng,et al.  Deep Supervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[3]  Wu-Jun Li,et al.  Deep Cross-Modal Hashing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Nikhil Rasiwasia,et al.  Cluster Canonical Correlation Analysis , 2014, AISTATS.

[5]  Yi Yu,et al.  Audio-Visual Embedding for Cross-Modal Music Video Retrieval through Supervised Deep CCA , 2018, 2018 IEEE International Symposium on Multimedia (ISM).

[6]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[7]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[8]  Qi Tian,et al.  Generalized Semi-supervised and Structured Subspace Learning for Cross-Modal Retrieval , 2018, IEEE Transactions on Multimedia.

[9]  Jeff A. Bilmes,et al.  On Deep Multi-View Representation Learning , 2015, ICML.

[10]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[11]  Jian Wang,et al.  Image-Text Cross-Modal Retrieval via Modality-Specific Feature Learning , 2015, ICMR.

[12]  Yuxin Peng,et al.  Cross-modal Bidirectional Translation via Reinforcement Learning , 2018, IJCAI.

[13]  Xin Huang,et al.  An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[14]  Gang Wang,et al.  Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[16]  Beng Chin Ooi,et al.  Effective deep learning-based multi-modal retrieval , 2015, The VLDB Journal.

[17]  Colin Fyfe,et al.  Kernel and Nonlinear Canonical Correlation Analysis , 2000, IJCNN.

[18]  Xiangyang Xue,et al.  Cross-Modal Image Clustering via Canonical Correlation Analysis , 2015, AAAI.

[19]  Xiaohua Zhai,et al.  Heterogeneous Metric Learning with Joint Graph Regularization for Cross-Media Retrieval , 2013, AAAI.

[20]  Hugo Latapie,et al.  Learning Audio-Visual Correlations From Variational Cross-Modal Generation , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Max D. Morris,et al.  The spatial correlation function approach to response surface estimation , 1992, WSC '92.

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  Shotaro Akaho,et al.  A kernel method for canonical correlation analysis , 2006, ArXiv.

[24]  Huimin Lu,et al.  Deep adversarial metric learning for cross-modal retrieval , 2019, World Wide Web.

[25]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[26]  Roger Zimmermann,et al.  Automatic music soundtrack generation for outdoor videos from contextual sensor information , 2012, ACM Multimedia.

[27]  Wei Wang,et al.  A Comprehensive Survey on Cross-modal Retrieval , 2016, ArXiv.

[28]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[29]  Yueting Zhuang,et al.  Cross-modal correlation learning for clustering on image-audio dataset , 2007, ACM Multimedia.

[30]  Lei Chen,et al.  Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[31]  Wen Gao,et al.  Multiview Metric Learning with Global Consistency and Local Smoothness , 2012, TIST.

[32]  Keizo Oyama,et al.  Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[33]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[34]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[35]  Xiaoyan Gu,et al.  Adversary Guided Asymmetric Hashing for Cross-Modal Retrieval , 2019, ICMR.

[36]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[37]  Huimin Lu,et al.  Unsupervised cross-modal retrieval through adversarial learning , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[38]  B. Silverman,et al.  Canonical correlation analysis when the data are curves. , 1993 .

[39]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Qi Tian,et al.  Adaptively Unified Semi-supervised Learning for Cross-Modal Retrieval , 2017, IJCAI.

[41]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[42]  Tieniu Tan,et al.  Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Yuxin Peng,et al.  CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network , 2017, IEEE Transactions on Multimedia.

[44]  Kiyoharu Aizawa,et al.  Category-Based Deep CCA for Fine-Grained Venue Discovery From Multimodal Data , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[45]  Yuxin Peng,et al.  Unsupervised Generative Adversarial Cross-modal Hashing , 2017, AAAI.

[46]  Chen Fang,et al.  Visual to Sound: Generating Natural Sound for Videos in the Wild , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Yuxin Peng,et al.  Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[48]  Xiaojun Chang,et al.  Adaptive Semi-Supervised Feature Selection for Cross-Modal Retrieval , 2019, IEEE Transactions on Multimedia.

[49]  Karen Livescu,et al.  Large-Scale Approximate Kernel Canonical Correlation Analysis , 2015, ICLR.

[50]  Alan C. Bovik,et al.  Mean squared error: Love it or leave it? A new look at Signal Fidelity Measures , 2009, IEEE Signal Processing Magazine.

[51]  Yi Yu,et al.  ADVISOR: Personalized Video Soundtrack Recommendation by Late Fusion with Heuristic Rankings , 2014, ACM Multimedia.

[52]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Hongbin Zha,et al.  Joint Latent Subspace Learning and Regression for Cross-Modal Retrieval , 2017, SIGIR.

[54]  Devraj Mandal,et al.  A Deep Learning Framework for Semi-Supervised Cross-Modal Retrieval with Label Prediction , 2018, ArXiv.

[55]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[56]  Yu Qiao,et al.  A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.