Image2song: Song Retrieval via Bridging Image Content and Lyric Words

Image is usually taken for expressing some kinds of emotions or purposes, such as love, celebrating Christmas. There is another better way that combines the image and relevant song to amplify the expression, which has drawn much attention in the social network recently. Hence, the automatic selection of songs should be expected. In this paper, we propose to retrieve semantic relevant songs just by an image query, which is named as the image2song problem. Motivated by the requirements of establishing correlation in semantic/content, we build a semantic-based song retrieval framework, which learns the correlation between image content and lyric words. This model uses a convolutional neural network to generate rich tags from image regions, a recurrent neural network to model lyric, and then establishes correlation via a multi-layer perceptron. To reduce the content gap between image and lyric, we propose to make the lyric modeling focus on the main image content via a tag attention. We collect a dataset from the social-sharing multimodal data to study the proposed problem, which consists of (image, music clip, lyric) triplets. We demonstrate that our proposed model shows noticeable results in the image2song retrieval task and provides suitable songs. Besides, the song2image task is also performed.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[3]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Menno van Zaanen,et al.  Automatic Mood Classification Using TF*IDF Based on Lyrics , 2010, ISMIR.

[5]  Alexei A. Efros,et al.  Discovering object categories in image collections , 2005 .

[6]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[7]  Xuelong Li,et al.  Visual music and musical vision , 2008, Neurocomputing.

[8]  Sanja Fidler,et al.  Song From PI: A Musically Plausible Network for Pop Music Generation , 2016, ICLR.

[9]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[10]  Peter Knees,et al.  A music search engine built upon audio-based and web-based similarity measures , 2007, SIGIR.

[11]  Yanjun Qi,et al.  Polynomial Semantic Indexing , 2009, NIPS.

[12]  D. Signorini,et al.  Neural networks , 1995, The Lancet.

[13]  Fei-Fei Li,et al.  What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[14]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[15]  Andreas F. Ehmann,et al.  Lyric Text Mining in Music Mood Classification , 2009, ISMIR.

[16]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Òscar Celma,et al.  QueryBag: Using Different Sources For Querying Large Music Collections , 2009 .

[18]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[19]  Hendrik P. A. Lensch,et al.  Auto-Illustrating Poems and Songs with Style , 2016, ACCV.

[20]  Masataka Goto,et al.  Music Thumbnailer: Visualizing Musical Pieces in Thumbnail Images Based on Acoustic Features , 2008, ISMIR.

[21]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[22]  Honglak Lee,et al.  Improved Multimodal Deep Learning with Variation of Information , 2014, NIPS.

[23]  Martha Larson,et al.  When music makes a scene , 2013, International Journal of Multimedia Information Retrieval.

[24]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[25]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[26]  Sidney S. Simon,et al.  Merging of the Senses , 2008, Front. Neurosci..

[27]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[28]  Bowen Zhou,et al.  LSTM-based Deep Learning Models for non-factoid answer selection , 2015, ArXiv.

[29]  Jeffrey J. Scott,et al.  MUSIC EMOTION RECOGNITION: A STATE OF THE ART REVIEW , 2010 .

[30]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[31]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Yu Zheng,et al.  Retrieving Web Images to Enrich Music Representation , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[33]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Xirong Li,et al.  Word2VisualVec: Image and Video to Sentence Matching by Visual Feature Prediction , 2016 .

[35]  Wei-Ying Ma,et al.  Automated Music Video Generation using WEB Image Resource , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[36]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[37]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Òscar Celma,et al.  Search Sounds: An audio crawler focused on weblogs , 2006, ISMIR.

[39]  Tao Jin,et al.  Automatic Generation of Music Slide Show Using Personal Photos , 2008, 2008 Tenth IEEE International Symposium on Multimedia.

[40]  Markus Schedl,et al.  Music Information Retrieval: Recent Developments and Applications , 2014, Found. Trends Inf. Retr..

[41]  Xirong Li,et al.  Word2VisualVec: Cross-Media Retrieval by Visual Feature Prediction , 2016, ArXiv.

[42]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[43]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[44]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Xuelong Li,et al.  Multimodal Learning via Exploring Deep Semantic Similarity , 2016, ACM Multimedia.

[46]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[47]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.