MusiCoder: A Universal Music-Acoustic Encoder Based on Transformers

Music annotation has always been one of the critical topics in the field of Music Information Retrieval (MIR). Traditional models use supervised learning for music annotation tasks. However, as supervised machine learning approaches increase in complexity, the increasing need for more annotated training data can often not be matched with available data. Moreover, over-reliance on labeled data when training supervised learning models can lead to unexpected results and open vulnerabilities for adversarial attacks. In this paper, a new self-supervised music acoustic representation learning approach named MusiCoder is proposed. Inspired by the success of BERT, MusiCoder builds upon the architecture of self-attention bidirectional transformers. Two pre-training objectives, including Contiguous Frames Masking (CFM) and Contiguous Channels Masking (CCM), are designed to adapt BERT-like masked reconstruction pre-training to continuous acoustic frame domain. The performance of MusiCoder is evaluated in two downstream music annotation tasks. The results show that MusiCoder outperforms the state-of-the-art models in both music genre classification and auto-tagging tasks. The effectiveness of MusiCoder indicates a great potential of a new self-supervised learning approach to understand music: first apply masked reconstruction tasks to pre-train a transformer-based model with massive unlabeled music acoustic data, and then finetune the model on specific downstream tasks with labeled data.

[1]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[2]  Dawn Song,et al.  Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty , 2019, NeurIPS.

[3]  Bob L. Sturm The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use , 2013, ArXiv.

[4]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[5]  Andrew M. Dai,et al.  Music Transformer: Generating Music with Long-Term Structure , 2018, ICLR.

[6]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[7]  Ludwig Schmidt,et al.  Unlabeled Data Improves Adversarial Robustness , 2019, NeurIPS.

[8]  Yuzong Liu,et al.  Deep Contextualized Acoustic Representations for Semi-Supervised Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[10]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[11]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[12]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[13]  Benjamin Schrauwen,et al.  Deep content-based music recommendation , 2013, NIPS.

[14]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[15]  Shang-Wen Li,et al.  Audio Albert: A Lite Bert for Self-Supervised Learning of Audio Representation , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).

[16]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[17]  Xavier Serra,et al.  End-to-end Learning for Music Audio Tagging at Scale , 2017, ISMIR.

[18]  Yi-Hsuan Yang,et al.  MediaEval 2019 Emotion and Theme Recognition task: A VQ-VAE Based Approach , 2019, MediaEval.

[19]  Xavier Serra,et al.  The MTG-Jamendo Dataset for Automatic Music Tagging , 2019, ICML 2019.

[20]  Gerhard Widmer,et al.  Emotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNs , 2019, MediaEval.

[21]  Yi-Hsuan Yang,et al.  Pop Music Transformer: Generating Music with Rhythm and Harmony , 2020, ArXiv.

[22]  Hui Zhang,et al.  The PMEmo Dataset for Music Emotion Recognition , 2018, ICMR.

[23]  Mark Sandler,et al.  Transfer Learning for Music Classification and Regression Tasks , 2017, ISMIR.

[24]  Hung-yi Lee,et al.  Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[26]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[27]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[28]  Guangsen Wang,et al.  Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks , 2020, INTERSPEECH.

[29]  Yan Liu,et al.  Learning Music Emotion Primitives via Supervised Dynamic Clustering , 2016, ACM Multimedia.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Xavier Bresson,et al.  FMA: A Dataset for Music Analysis , 2016, ISMIR.

[32]  Karen Livescu,et al.  Unsupervised Pre-Training of Bidirectional Speech Encoders via Masked Reconstruction , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[34]  Climent Nadeu,et al.  On Real-Time Mean-and-Variance Normalization of Speech Recognition Features , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[35]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[36]  Kejun Zhang,et al.  Web music emotion recognition based on higher effective gene expression programming , 2013, Neurocomputing.

[37]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[38]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[39]  Yi-Hsuan Yang,et al.  Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions , 2020, ACM Multimedia.

[40]  Chun Chen,et al.  Music recommendation by unified hypergraph: combining social media information and music content , 2010, ACM Multimedia.

[41]  Ze-Nian Li,et al.  Audio feature reduction and analysis for automatic music genre classification , 2014, 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[42]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[43]  Maheshkumar H. Kolekar,et al.  Music Genre Recognition Using Deep Neural Networks and Transfer Learning , 2018, INTERSPEECH.

[44]  Marcos Aurélio Domingues,et al.  Music4All: A New Music Database and Its Applications , 2020, 2020 International Conference on Systems, Signals and Image Processing (IWSSIP).

[45]  Hongfu Liu,et al.  Mind Band: A Crossmedia AI Music Composing Platform , 2019, ACM Multimedia.