On the Robustness of Deep Convolutional Neural Networks for Music Classification

Deep neural networks (DNN) have been successfully applied for music classification including music tagging. However, there are several open questions regarding generalisation and best practices in the choice of network architectures, hyperparameters and input representations. In this article, we investigate specific aspects of neural networks to deepen our understanding of their properties. We analyse and (re-)validate a large music tagging dataset to investigate the reliability of training and evaluation. We perform comprehensive experiments involving audio preprocessing using different time-frequency representations, logarithmic magnitude compression, frequency weighting and scaling. Using a trained network, we compute label vector similarities which is compared to groundtruth similarity. The results highlight several import aspects of music tagging and neural networks. We show that networks can be effective despite of relatively large error rates in groundtruth datasets. We subsequently show that many commonly used input preprocessing techniques are redundant except magnitude compression. Lastly, the analysis of our trained network provides valuable insight into the relationships between music tags. These results highlight the benefit of using data-driven methods to address automatic music tagging.

[1]  Perry R. Cook,et al.  Content-Based Musical Similarity Computation using the Hierarchical Dirichlet Process , 2008, ISMIR.

[2]  Jan Schlüter,et al.  Musical Onset Detection with Convolutional Neural Networks , 2013 .

[3]  Lihua Li Audio musical genre classification using convolutional neural networks and pitch and tempo transformations , 2010 .

[4]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[5]  Benjamin Schrauwen,et al.  Deep content-based music recommendation , 2013, NIPS.

[6]  Juan Pablo Bello,et al.  Rethinking Automatic Chord Recognition with Convolutional Neural Networks , 2012, 2012 11th International Conference on Machine Learning and Applications.

[7]  Jan Schlüter,et al.  Learning to Pinpoint Singing Voice from Weakly Labeled Examples , 2016, ISMIR.

[8]  György Fazekas,et al.  Semantic models of musical mood: Comparison between crowd-sourced and curated editorial tags , 2013, 2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[9]  Juhan Nam,et al.  Multi-Level and Multi-Scale Feature Aggregation Using Pretrained Convolutional Neural Networks for Music Auto-Tagging , 2017, IEEE Signal Processing Letters.

[10]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[11]  Juhan Nam,et al.  Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms , 2017, ArXiv.

[12]  Jae-Hun Kim,et al.  Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Gert R. G. Lanckriet,et al.  Towards musical query-by-semantic-description using the CAL500 data set , 2007, SIGIR.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[16]  Simon Dixon,et al.  An End-to-End Neural Network for Polyphonic Music Transcription , 2015, ArXiv.

[17]  Colin Raffel,et al.  librosa: 0.4.1 , 2015 .

[18]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[19]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[20]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Aren Jansen,et al.  Large-scale audio event discovery in one million YouTube videos , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[23]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[24]  Lina J. Karam,et al.  Understanding how image quality affects deep neural networks , 2016, 2016 Eighth International Conference on Quality of Multimedia Experience (QoMEX).

[25]  Sebastian Böck,et al.  Improved musical onset detection with Convolutional Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[27]  Michael I. Mandel,et al.  Evaluation of Algorithms Using Games: The Case of Music Tagging , 2009, ISMIR.

[28]  Paul Lamere,et al.  Social Tagging and Music Information Retrieval , 2008 .

[29]  Juhan Nam,et al.  A Deep Bag-of-Features Model for Music Auto-Tagging , 2015, ArXiv.

[30]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .

[31]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[32]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Lei Wang,et al.  Convolutional Recurrent Neural Networks for Text Classification , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[34]  Florian Krebs,et al.  Joint Beat and Downbeat Tracking with Recurrent Neural Networks , 2016, ISMIR.

[35]  Mark B. Sandler,et al.  Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[36]  Thomas Grill,et al.  Boundary Detection in Music Structure Analysis using Convolutional Neural Networks , 2014, ISMIR.

[37]  Thierry Bertin-Mahieux,et al.  Automatic Generation of Social Tags for Music Recommendation , 2007, NIPS.

[38]  Lorenzo Torresani,et al.  Weakly Supervised Learning , 2014, Computer Vision, A Reference Guide.

[39]  Mark Sandler,et al.  Transfer Learning for Music Classification and Regression Tasks , 2017, ISMIR.