Cross-modal Embeddings for Video and Audio Retrieval

The increasing amount of online videos brings several opportunities for training self-supervised neural networks. The creation of large scale datasets of videos such as the YouTube-8M allows us to deal with this large amount of data in manageable way. In this work, we find new ways of exploiting this dataset by taking advantage of the multi-modal information it provides. By means of a neural network, we are able to create links between audio and visual documents, by projecting them into a common region of the feature space, obtaining joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given a query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning. We train embeddings for both scales and assess their quality in a retrieval problem, formulated as using the feature extracted from one modality to retrieve the most similar videos based on the features computed in the other modality.

[1]  Yueting Zhuang,et al.  Cross-modal correlation learning for clustering on image-audio dataset , 2007, ACM Multimedia.

[2]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Douglas Turnbull,et al.  You Can Judge an Artist by an Album Cover: Using Images for Music Annotation , 2011, IEEE MultiMedia.

[4]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Antonio Torralba,et al.  See, Hear, and Read: Deep Aligned Representations , 2017, ArXiv.

[6]  Yong Yu,et al.  TuneSensor: A Semantic-Driven Music Recommendation Service For Digital Photo Albums , 2011 .

[7]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[8]  Xiaogang Wang,et al.  Bridging Music and Image via Cross-Modal Ranking Analysis , 2016, IEEE Transactions on Multimedia.

[9]  Gaël Richard,et al.  On the Correlation of Automatic Audio and Visual Segmentations of Music Videos , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Amaia Salvador,et al.  Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andreas Rauber,et al.  An Audio-Visual Approach to Music Genre Classification through Affective Color Features , 2015, ECIR.

[12]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[13]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[14]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[15]  Frank Hopfgartner,et al.  Understanding Affective Content of Music Videos through Learned Representations , 2014, MMM.

[16]  Nando de Freitas,et al.  The Sound of an Album Cover: Probabilistic Multimedia and IR , 2002 .

[17]  Haofen Wang,et al.  A Semantic-Driven Music Recommendation Model for Digital Photo Albums , 2012, CSWS.

[18]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[19]  Rudolf Mayer,et al.  Analysing the Similarity of Album Art with Self-Organising Maps , 2011, WSOM.

[20]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[21]  Hyun Seung Yang,et al.  Deep Learning for Content-Based, Cross-Modal Retrieval of Videos and Music , 2017, ArXiv.