论文信息 - Query by Video: Cross-modal Music Retrieval

Query by Video: Cross-modal Music Retrieval

Cross-modal retrieval learns the relationship between the two types of data in a common space so that an input from one modality can retrieve data from a different modality. We focus on modeling the relationship between two highly diverse data, music and real-world videos. We learn crossmodal embeddings using a two-stream network trained with music-video pairs. Each branch takes one modality as the input and it is constrained with emotion tags. Then the constraints allow the cross-modal embeddings to be learned with significantly fewer music-video pairs. To retrieve music for an input video, the trained model ranks tracks in the music database by cross-modal distances to the query video. Quantitative evaluations show high accuracy of audio/video emotion tagging when evaluated on each branch independently and high performance for cross-modal music retrieval. We also present crossmodal music retrieval experiments on Spotify music using user-generated videos from Instagram and Youtube as queries, and subjective evaluations show that the proposed model can retrieve relevant music. We present the music retrieval results at: http://www.ece.rochester. edu/~bli23/projects/query.html.

Bochen Li | Aparna Kumar | Bochen Li | Aparna Kumar

[1] In-Kwon Lee,et al. Music synchronization with video using emotion similarity , 2017, 2017 IEEE International Conference on Big Data and Smart Computing (BigComp).

[2] Chuang Gan,et al. The Sound of Pixels , 2018, ECCV.

[3] James Ze Wang,et al. On shape and the computability of emotions , 2012, ACM Multimedia.

[4] J. Russell. A circumplex model of affect. , 1980 .

[5] Hyun Seung Yang,et al. CBVMR: Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint , 2018, ICMR.

[6] Bochen Li,et al. AUDIO-VISUAL SOURCE ASSOCIATION FOR STRING ENSEMBLES THROUGH MULTI-MODAL VIBRATO ANALYSIS , 2017 .

[7] Yuan Yuan,et al. Deep Cross-Modal Retrieval for Remote Sensing Image and Audio , 2018, 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS).

[8] Bolei Zhou,et al. Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9] Yong Yu,et al. TuneSensor: A Semantic-Driven Music Recommendation Service For Digital Photo Albums , 2011 .

[10] Albert Ali Salah,et al. Video-based emotion recognition in the wild using deep transfer learning and score fusion , 2017, Image Vis. Comput..

[11] Subbarao Kambhampati,et al. What We Instagram: A First Analysis of Instagram Photo Content and User Types , 2014, ICWSM.

[12] Jeffrey J. Scott,et al. MUSIC EMOTION RECOGNITION: A STATE OF THE ART REVIEW , 2010 .

[13] Gert R. G. Lanckriet,et al. The Natural Language of Playlists , 2011, ISMIR.

[14] Shigeo Morishima,et al. Affective Music Recommendation System Based on the Mood of Input Video , 2015, MMM.

[15] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] P. Laukka,et al. Expression, Perception, and Induction of Musical Emotions: A Review and a Questionnaire Study of Everyday Listening , 2004 .

[17] Yi-Hsuan Yang,et al. 1000 songs for emotional analysis of music , 2013, CrowdMM '13.

[18] Giacomo Mauro DAriano. The Journal of Personality and Social Psychology. , 2002 .

[19] Andrew Zisserman,et al. Objects that Sound , 2017, ECCV.

[20] Peter A. Flach,et al. An Improved Model Selection Heuristic for AUC , 2007, ECML.

[21] Eyke Hüllermeier,et al. A critical analysis of variants of the AUC , 2008, Machine Learning.

[22] Hsin-Min Wang,et al. EMV-matchmaker: Emotional Temporal Course Modeling and Matching for Automatic Music Video Generation , 2015, ACM Multimedia.

[23] Gerhard Widmer,et al. Cross-Modal Music Retrieval and Applications: An Overview of Key Methodologies , 2019, IEEE Signal Processing Magazine.

[24] Emmanuel Dellandréa,et al. LIRIS-ACCEDE: A Video Database for Affective Content Analysis , 2015, IEEE Transactions on Affective Computing.

[25] Yue Gao,et al. Exploring Principles-of-Art Features For Image Emotion Recognition , 2014, ACM Multimedia.

[26] Yi-Hsuan Yang,et al. Machine Recognition of Music Emotion: A Review , 2012, TIST.

[27] Yi-Hsuan Yang,et al. The acousticvisual emotion guassians model for automatic generation of music video , 2012, ACM Multimedia.

[28] Boyang Li,et al. Video Emotion Recognition with Transferred Deep Feature Encodings , 2016, ICMR.

[29] Christopher Joseph Pal,et al. EmoNets: Multimodal deep learning approaches for emotion recognition in video , 2015, Journal on Multimodal User Interfaces.

[30] Chenliang Xu,et al. Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[31] Yann LeCun,et al. Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[32] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Mark B. Sandler,et al. Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[34] J. Stephen Downie,et al. Music information retrieval , 2005, Annu. Rev. Inf. Sci. Technol..

[35] Gaurav Sharma,et al. See and listen: Score-informed association of sound tracks to players in chamber music performance videos , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36] Bochen Li,et al. Skeleton Plays Piano: Online Generation of Pianist Body Movements from MIDI Performance , 2018, ISMIR.

[37] Bryan Pardo,et al. Siamese Style Convolutional Neural Networks for Sound Search by Vocal Imitation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[38] Christina A. Jackson,et al. Self-presentation mediates the relationship between Self-criticism and emotional response to Instagram feedback , 2017, Personality and Individual Differences.

[39] Siwoo Byun,et al. Automated music video generation using multi-level feature-based segmentation , 2009, Multimedia tools and applications.

[40] Alan S. Cowen,et al. Self-report captures 27 distinct categories of emotion bridged by continuous gradients , 2017, Proceedings of the National Academy of Sciences.

[41] Roger Zimmermann,et al. Automatic music soundtrack generation for outdoor videos from contextual sensor information , 2012, ACM Multimedia.

[42] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[43] Alan Hanjalic,et al. Affective video content representation and modeling , 2005, IEEE Transactions on Multimedia.