Query by Video: Cross-modal Music Retrieval

Cross-modal retrieval learns the relationship between the two types of data in a common space so that an input from one modality can retrieve data from a different modality. We focus on modeling the relationship between two highly diverse data, music and real-world videos. We learn crossmodal embeddings using a two-stream network trained with music-video pairs. Each branch takes one modality as the input and it is constrained with emotion tags. Then the constraints allow the cross-modal embeddings to be learned with significantly fewer music-video pairs. To retrieve music for an input video, the trained model ranks tracks in the music database by cross-modal distances to the query video. Quantitative evaluations show high accuracy of audio/video emotion tagging when evaluated on each branch independently and high performance for cross-modal music retrieval. We also present crossmodal music retrieval experiments on Spotify music using user-generated videos from Instagram and Youtube as queries, and subjective evaluations show that the proposed model can retrieve relevant music. We present the music retrieval results at: http://www.ece.rochester. edu/~bli23/projects/query.html.

[1]  In-Kwon Lee,et al.  Music synchronization with video using emotion similarity , 2017, 2017 IEEE International Conference on Big Data and Smart Computing (BigComp).

[2]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[3]  James Ze Wang,et al.  On shape and the computability of emotions , 2012, ACM Multimedia.

[4]  J. Russell A circumplex model of affect. , 1980 .

[5]  Hyun Seung Yang,et al.  CBVMR: Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint , 2018, ICMR.

[6]  Bochen Li,et al.  AUDIO-VISUAL SOURCE ASSOCIATION FOR STRING ENSEMBLES THROUGH MULTI-MODAL VIBRATO ANALYSIS , 2017 .

[7]  Yuan Yuan,et al.  Deep Cross-Modal Retrieval for Remote Sensing Image and Audio , 2018, 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS).

[8]  Bolei Zhou,et al.  Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Yong Yu,et al.  TuneSensor: A Semantic-Driven Music Recommendation Service For Digital Photo Albums , 2011 .

[10]  Albert Ali Salah,et al.  Video-based emotion recognition in the wild using deep transfer learning and score fusion , 2017, Image Vis. Comput..

[11]  Subbarao Kambhampati,et al.  What We Instagram: A First Analysis of Instagram Photo Content and User Types , 2014, ICWSM.

[12]  Jeffrey J. Scott,et al.  MUSIC EMOTION RECOGNITION: A STATE OF THE ART REVIEW , 2010 .

[13]  Gert R. G. Lanckriet,et al.  The Natural Language of Playlists , 2011, ISMIR.

[14]  Shigeo Morishima,et al.  Affective Music Recommendation System Based on the Mood of Input Video , 2015, MMM.

[15]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  P. Laukka,et al.  Expression, Perception, and Induction of Musical Emotions: A Review and a Questionnaire Study of Everyday Listening , 2004 .

[17]  Yi-Hsuan Yang,et al.  1000 songs for emotional analysis of music , 2013, CrowdMM '13.

[18]  Giacomo Mauro DAriano The Journal of Personality and Social Psychology. , 2002 .

[19]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[20]  Peter A. Flach,et al.  An Improved Model Selection Heuristic for AUC , 2007, ECML.

[21]  Eyke Hüllermeier,et al.  A critical analysis of variants of the AUC , 2008, Machine Learning.

[22]  Hsin-Min Wang,et al.  EMV-matchmaker: Emotional Temporal Course Modeling and Matching for Automatic Music Video Generation , 2015, ACM Multimedia.

[23]  Gerhard Widmer,et al.  Cross-Modal Music Retrieval and Applications: An Overview of Key Methodologies , 2019, IEEE Signal Processing Magazine.

[24]  Emmanuel Dellandréa,et al.  LIRIS-ACCEDE: A Video Database for Affective Content Analysis , 2015, IEEE Transactions on Affective Computing.

[25]  Yue Gao,et al.  Exploring Principles-of-Art Features For Image Emotion Recognition , 2014, ACM Multimedia.

[26]  Yi-Hsuan Yang,et al.  Machine Recognition of Music Emotion: A Review , 2012, TIST.

[27]  Yi-Hsuan Yang,et al.  The acousticvisual emotion guassians model for automatic generation of music video , 2012, ACM Multimedia.

[28]  Boyang Li,et al.  Video Emotion Recognition with Transferred Deep Feature Encodings , 2016, ICMR.

[29]  Christopher Joseph Pal,et al.  EmoNets: Multimodal deep learning approaches for emotion recognition in video , 2015, Journal on Multimodal User Interfaces.

[30]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[31]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[32]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Mark B. Sandler,et al.  Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[34]  J. Stephen Downie,et al.  Music information retrieval , 2005, Annu. Rev. Inf. Sci. Technol..

[35]  Gaurav Sharma,et al.  See and listen: Score-informed association of sound tracks to players in chamber music performance videos , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Bochen Li,et al.  Skeleton Plays Piano: Online Generation of Pianist Body Movements from MIDI Performance , 2018, ISMIR.

[37]  Bryan Pardo,et al.  Siamese Style Convolutional Neural Networks for Sound Search by Vocal Imitation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[38]  Christina A. Jackson,et al.  Self-presentation mediates the relationship between Self-criticism and emotional response to Instagram feedback , 2017, Personality and Individual Differences.

[39]  Siwoo Byun,et al.  Automated music video generation using multi-level feature-based segmentation , 2009, Multimedia tools and applications.

[40]  Alan S. Cowen,et al.  Self-report captures 27 distinct categories of emotion bridged by continuous gradients , 2017, Proceedings of the National Academy of Sciences.

[41]  Roger Zimmermann,et al.  Automatic music soundtrack generation for outdoor videos from contextual sensor information , 2012, ACM Multimedia.

[42]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[43]  Alan Hanjalic,et al.  Affective video content representation and modeling , 2005, IEEE Transactions on Multimedia.