Enriched Music Representations With Multiple Cross-Modal Contrastive Learning

Modeling various aspects that make a music piece unique is a challenging task, requiring the combination of multiple sources of information. Deep learning is commonly used to obtain representations using various sources of information, such as the audio, interactions between users and songs, or associated genre metadata. Recently, contrastive learning has led to representations that generalize better compared to traditional supervised methods. In this paper, we present a novel approach that combines multiple types of information related to music using cross-modal contrastive learning, allowing us to learn an audio feature from heterogeneous data simultaneously. We align the latent representations obtained from playlists-track interactions, genre metadata, and the tracks’ audio, by maximizing the agreement between these modality representations using a contrastive loss. We evaluate our approach in three tasks, namely, genre classification, playlist continuation and automatic tagging. We compare the performances with a baseline audio-based CNN trained to predict these modalities. We also study the importance of including multiple sources of information when training our embedding model. The results suggest that the proposed method outperforms the baseline in all the three downstream tasks and achieves comparable performance to the state-of-the-art.

[1]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[2]  Xavier Serra,et al.  Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Xavier Serra,et al.  musicnn: Pre-trained convolutional neural networks for music audio tagging , 2019, ArXiv.

[4]  Ce Liu,et al.  Supervised Contrastive Learning , 2020, NeurIPS.

[5]  Òscar Celma,et al.  A new approach to evaluating novel recommendations , 2008, RecSys '08.

[6]  Bob L. Sturm,et al.  Deep Learning and Music Adversaries , 2015, IEEE Transactions on Multimedia.

[7]  Hao-Yu Wu,et al.  Classification is a Strong Baseline for Deep Metric Learning , 2018, BMVC.

[8]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[9]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[10]  Xavier Serra,et al.  Multimodal Metric Learning for Tag-Based Music Retrieval , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[12]  Xavier Serra,et al.  Multimodal Deep Learning for Music Genre Classification , 2018, Trans. Int. Soc. Music. Inf. Retr..

[13]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Xavier Serra,et al.  Tensorflow Audio Models in Essentia , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[16]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[17]  Xavier Serra,et al.  The MTG-Jamendo Dataset for Automatic Music Tagging , 2019, ICML 2019.

[18]  Grigorios Tsoumakas,et al.  On the Stratification of Multi-label Data , 2011, ECML/PKDD.

[19]  Xavier Serra,et al.  COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations , 2020, ArXiv.

[20]  Juhan Nam,et al.  Metric learning vs classification for disentangled music representation learning , 2020, ISMIR.

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Neil Zeghidour,et al.  Contrastive Learning of General-Purpose Audio Representations , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[24]  Jordi Torres,et al.  Cross-modal Embeddings for Video and Audio Retrieval , 2018, ECCV Workshops.

[25]  Xavier Serra,et al.  Evaluation of CNN-based Automatic Music Tagging Models , 2020, ArXiv.

[26]  Noel E. O'Connor,et al.  Unsupervised Contrastive Learning of Sound Event Representations , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Minz Won,et al.  Mood Classification Using Listening Data , 2020, ISMIR.

[28]  Sanjoy Dasgupta,et al.  Random projection trees and low dimensional manifolds , 2008, STOC.

[29]  Alan F. Smeaton,et al.  Contrastive Representation Learning: A Framework and Review , 2020, IEEE Access.

[30]  Xavier Serra,et al.  Melon Playlist Dataset: A Public Dataset for Audio-Based Playlist Generation and Music Tagging , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Juhan Nam,et al.  Zero-shot Learning for Audio-based Music Classification and Tagging , 2019, ISMIR.

[32]  Justin Salamon,et al.  Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Xavier Serra,et al.  How Low Can You Go? Reducing Frequency and Time Resolution in Current CNN Architectures for Music Auto-tagging , 2021, 2020 28th European Signal Processing Conference (EUSIPCO).