论文信息 - Multi-Level and Multi-Scale Feature Aggregation Using Pretrained Convolutional Neural Networks for Music Auto-Tagging

Multi-Level and Multi-Scale Feature Aggregation Using Pretrained Convolutional Neural Networks for Music Auto-Tagging

Music auto-tagging is often handled in a similar manner to image classification by regarding the two-dimensional audio spectrogram as image data. However, music auto-tagging is distinguished from image classification in that the tags are highly diverse and have different levels of abstraction. Considering this issue, we propose a convolutional neural networks (CNN)-based architecture that embraces multi-level and multi-scaled features. The architecture is trained in three steps. First, we conduct supervised feature learning to capture local audio features using a set of CNNs with different input sizes. Second, we extract audio features from each layer of the pretrained convolutional networks separately and aggregate them altogether giving a long audio clip. Finally, we put them into fully connected networks and make final predictions of the tags. Our experiments show that using the combination of multi-level and multi-scale features is highly effective in music auto-tagging and the proposed method outperforms the previous state-of-the-art methods on the MagnaTagATune dataset and the Million Song Dataset. We further show that the proposed architecture is useful in transfer learning.

Juhan Nam | Jongpil Lee | Juhan Nam | Jongpil Lee

[1] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2] Xavier Serra,et al. Experimenting with musically motivated convolutional neural networks , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).

[3] Benjamin Schrauwen,et al. Transfer Learning by Supervised Pre-training for Audio-based Music Classification , 2014, ISMIR.

[4] Douglas Eck,et al. Temporal Pooling and Multiscale Learning for Automatic Annotation and Ranking of Music Audio , 2011, ISMIR.

[5] Honglak Lee,et al. Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[6] Kyogu Lee,et al. Learning Temporal Features Using a Deep Neural Network and its Application to Music Genre Classification , 2016, ISMIR.

[7] Douglas Eck,et al. Learning Features from Music Audio with Deep Belief Networks , 2010, ISMIR.

[8] Mark Sandler,et al. Convolutional recurrent neural networks for music classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Mark B. Sandler,et al. Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[10] Hendrik Schreiber,et al. Improving Genre Annotations for the Million Song Dataset , 2015, ISMIR.

[11] Thierry Bertin-Mahieux,et al. The Million Song Dataset , 2011, ISMIR.

[12] Michael I. Mandel,et al. Evaluation of Algorithms Using Games: The Case of Music Tagging , 2009, ISMIR.

[13] John Salvatier,et al. Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[14] Matthew E. P. Davies,et al. Transfer Learning In Mir: Sharing Learned Latent Representations For Music Audio Classification And Similarity , 2013, ISMIR.

[15] Juhan Nam,et al. A Deep Bag-of-Features Model for Music Auto-Tagging , 2015, ArXiv.

[16] George Tzanetakis,et al. Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[17] Marcel A. J. van Gerven,et al. Brains on Beats , 2016, NIPS.

[18] Benjamin Schrauwen,et al. Multiscale Approaches To Music Audio Feature Learning , 2013, ISMIR.

[19] Douglas Eck,et al. Building Musically-relevant Audio Features through Multiple Timescale Representations , 2012, ISMIR.

[20] Bob L. Sturm,et al. Deep Learning and Music Adversaries , 2015, IEEE Transactions on Multimedia.

[21] Juhan Nam,et al. Learning Sparse Feature Representations for Music Annotation and Retrieval , 2012, ISMIR.

[22] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[23] Yi-Hsuan Yang,et al. Applying Topological Persistence in Convolutional Neural Network for Music Audio Signals , 2016, ArXiv.

[24] Benjamin Schrauwen,et al. End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).