Multi-Level and Multi-Scale Feature Aggregation Using Pretrained Convolutional Neural Networks for Music Auto-Tagging

Music auto-tagging is often handled in a similar manner to image classification by regarding the two-dimensional audio spectrogram as image data. However, music auto-tagging is distinguished from image classification in that the tags are highly diverse and have different levels of abstraction. Considering this issue, we propose a convolutional neural networks (CNN)-based architecture that embraces multi-level and multi-scaled features. The architecture is trained in three steps. First, we conduct supervised feature learning to capture local audio features using a set of CNNs with different input sizes. Second, we extract audio features from each layer of the pretrained convolutional networks separately and aggregate them altogether giving a long audio clip. Finally, we put them into fully connected networks and make final predictions of the tags. Our experiments show that using the combination of multi-level and multi-scale features is highly effective in music auto-tagging and the proposed method outperforms the previous state-of-the-art methods on the MagnaTagATune dataset and the Million Song Dataset. We further show that the proposed architecture is useful in transfer learning.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Xavier Serra,et al.  Experimenting with musically motivated convolutional neural networks , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).

[3]  Benjamin Schrauwen,et al.  Transfer Learning by Supervised Pre-training for Audio-based Music Classification , 2014, ISMIR.

[4]  Douglas Eck,et al.  Temporal Pooling and Multiscale Learning for Automatic Annotation and Ranking of Music Audio , 2011, ISMIR.

[5]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[6]  Kyogu Lee,et al.  Learning Temporal Features Using a Deep Neural Network and its Application to Music Genre Classification , 2016, ISMIR.

[7]  Douglas Eck,et al.  Learning Features from Music Audio with Deep Belief Networks , 2010, ISMIR.

[8]  Mark Sandler,et al.  Convolutional recurrent neural networks for music classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Mark B. Sandler,et al.  Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[10]  Hendrik Schreiber,et al.  Improving Genre Annotations for the Million Song Dataset , 2015, ISMIR.

[11]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[12]  Michael I. Mandel,et al.  Evaluation of Algorithms Using Games: The Case of Music Tagging , 2009, ISMIR.

[13]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[14]  Matthew E. P. Davies,et al.  Transfer Learning In Mir: Sharing Learned Latent Representations For Music Audio Classification And Similarity , 2013, ISMIR.

[15]  Juhan Nam,et al.  A Deep Bag-of-Features Model for Music Auto-Tagging , 2015, ArXiv.

[16]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[17]  Marcel A. J. van Gerven,et al.  Brains on Beats , 2016, NIPS.

[18]  Benjamin Schrauwen,et al.  Multiscale Approaches To Music Audio Feature Learning , 2013, ISMIR.

[19]  Douglas Eck,et al.  Building Musically-relevant Audio Features through Multiple Timescale Representations , 2012, ISMIR.

[20]  Bob L. Sturm,et al.  Deep Learning and Music Adversaries , 2015, IEEE Transactions on Multimedia.

[21]  Juhan Nam,et al.  Learning Sparse Feature Representations for Music Annotation and Retrieval , 2012, ISMIR.

[22]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[23]  Yi-Hsuan Yang,et al.  Applying Topological Persistence in Convolutional Neural Network for Music Audio Signals , 2016, ArXiv.

[24]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).