论文信息 - Cross-modal Embeddings for Video and Audio Retrieval

Cross-modal Embeddings for Video and Audio Retrieval

The increasing amount of online videos brings several opportunities for training self-supervised neural networks. The creation of large scale datasets of videos such as the YouTube-8M allows us to deal with this large amount of data in manageable way. In this work, we find new ways of exploiting this dataset by taking advantage of the multi-modal information it provides. By means of a neural network, we are able to create links between audio and visual documents, by projecting them into a common region of the feature space, obtaining joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given a query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning. We train embeddings for both scales and assess their quality in a retrieval problem, formulated as using the feature extracted from one modality to retrieve the most similar videos based on the features computed in the other modality.

[1] Yueting Zhuang,et al. Cross-modal correlation learning for clustering on image-audio dataset , 2007, ACM Multimedia.

[2] Aren Jansen,et al. CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Douglas Turnbull,et al. You Can Judge an Artist by an Album Cover: Using Images for Music Annotation , 2011, IEEE MultiMedia.

[4] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Antonio Torralba,et al. See, Hear, and Read: Deep Aligned Representations , 2017, ArXiv.

[6] Yong Yu,et al. TuneSensor: A Semantic-Driven Music Recommendation Service For Digital Photo Albums , 2011 .

[7] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[8] Xiaogang Wang,et al. Bridging Music and Image via Cross-Modal Ranking Analysis , 2016, IEEE Transactions on Multimedia.

[9] Gaël Richard,et al. On the Correlation of Automatic Audio and Visual Segmentations of Music Videos , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[10] Amaia Salvador,et al. Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Andreas Rauber,et al. An Audio-Visual Approach to Music Genre Classification through Affective Color Features , 2015, ECIR.

[12] Marc'Aurelio Ranzato,et al. DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[13] Apostol Natsev,et al. YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[14] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[15] Frank Hopfgartner,et al. Understanding Affective Content of Music Videos through Learned Representations , 2014, MMM.

[16] Nando de Freitas,et al. The Sound of an Album Cover: Probabilistic Multimedia and IR , 2002 .

[17] Haofen Wang,et al. A Semantic-Driven Music Recommendation Model for Digital Photo Albums , 2012, CSWS.

[18] Ishwar K. Sethi,et al. Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[19] Rudolf Mayer,et al. Analysing the Similarity of Album Art with Self-Organising Maps , 2011, WSOM.

[20] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[21] Hyun Seung Yang,et al. Deep Learning for Content-Based, Cross-Modal Retrieval of Videos and Music , 2017, ArXiv.