Automatic Tagging Using Deep Convolutional Neural Networks

We present a content-based automatic music tagging algorithm using fully convolutional neural networks (FCNs). We evaluate different architectures consisting of 2D convolutional layers and subsampling layers only. In the experiments, we measure the AUC-ROC scores of the architectures with different complexities and input types using the MagnaTagATune dataset, where a 4-layer architecture shows state-of-the-art performance with mel-spectrogram input. Furthermore, we evaluated the performances of the architectures with varying the number of layers on a larger dataset (Million Song Dataset), and found that deeper models outperformed the 4-layer architecture. The experiments show that mel-spectrogram is an effective time-frequency representation for automatic tagging and that more complex models benefit from more training data.

[1]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[2]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .

[4]  Thomas Grill,et al.  Music boundary detection using neural networks on spectrograms and self-similarity lag matrices , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[5]  Benjamin Schrauwen,et al.  Deep content-based music recommendation , 2013, NIPS.

[6]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[7]  Thomas Grill,et al.  Boundary Detection in Music Structure Analysis using Convolutional Neural Networks , 2014, ISMIR.

[8]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Benjamin Schrauwen,et al.  Transfer Learning by Supervised Pre-training for Audio-based Music Classification , 2014, ISMIR.

[10]  Benjamin Schrauwen,et al.  Multiscale Approaches To Music Audio Feature Learning , 2013, ISMIR.

[11]  Daniel P. W. Ellis,et al.  Content-Aware Collaborative Music Recommendation Using Pre-trained Neural Networks , 2015, ISMIR.

[12]  Simon Dixon,et al.  An End-to-End Neural Network for Polyphonic Piano Music Transcription , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Simon Dixon,et al.  An End-to-End Neural Network for Polyphonic Music Transcription , 2015, ArXiv.

[14]  Juan Pablo Bello,et al.  Rethinking Automatic Chord Recognition with Convolutional Neural Networks , 2012, 2012 11th International Conference on Machine Learning and Applications.

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Juhan Nam,et al.  A Deep Bag-of-Features Model for Music Auto-Tagging , 2015, ArXiv.

[17]  Zehra Cataltepe,et al.  Audio Music Genre Classification Using Different Classifiers and Feature Selection Methods , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[18]  Douglas Eck,et al.  Temporal Pooling and Multiscale Learning for Automatic Annotation and Ranking of Music Audio , 2011, ISMIR.

[19]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[20]  Sebastian Böck,et al.  Improved musical onset detection with Convolutional Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[22]  Michael I. Mandel,et al.  Evaluation of Algorithms Using Games: The Case of Music Tagging , 2009, ISMIR.

[23]  Mark D. Plumbley,et al.  Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network , 2015, LVA/ICA.

[24]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[25]  Mark Sandler,et al.  AURALISATION OF DEEP CONVOLUTIONAL NEURAL NETWORKS: LISTENING TO LEARNED FEATURES , 2015 .

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[28]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.