Temporal Pooling and Multiscale Learning for Automatic Annotation and Ranking of Music Audio

This paper analyzes some of the challenges in performing automatic annotation and ranking of music audio, and proposes a few improvements. First, we motivate the use of principal component analysis on the mel-scaled spectrum. Secondly, we present an analysis of the impact of the selection of pooling functions for summarization of the features over time. We show that combining several pooling functions improves the performance of the system. Finally, we introduce the idea of multiscale learning. By incorporating these ideas in our model, we obtained state-of-the-art performance on the Magnatagatune dataset.

[1]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[2]  Lie Lu,et al.  Music type classification by spectral contrast feature , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[3]  Stephen Cox,et al.  Finding An Optimal Segmentation for Audio Genre Classification , 2005, ISMIR.

[4]  George Tzanetakis,et al.  MARSYAS SUBMISSIONS TO MIREX 2007 , 2007 .

[5]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[6]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[7]  Daniel P. W. Ellis,et al.  Multiple-Instance Learning for Music Information Retrieval , 2008, ISMIR.

[8]  Gonçalo Marques,et al.  A Music Classification Method based on Timbral Features , 2009, ISMIR.

[9]  Samy Bengio,et al.  MIREX SPECIAL TAGATUNE EVALUATION SUBMISSION , 2009 .

[10]  Edith Law,et al.  Input-agreement: a new mechanism for collecting data using human computation games , 2009, CHI.

[11]  Steven Ness,et al.  MARSYAS SUBMISSIONS TO MIREX 2009 , 2009 .

[12]  Michael I. Mandel,et al.  Evaluation of Algorithms Using Games: The Case of Music Tagging , 2009, ISMIR.

[13]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[14]  Jyh-Shing Roger Jang,et al.  On the Use of Anti-Word Models for Audio Music Annotation and Retrieval , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Shi-Huang Chen,et al.  Content-based music genre classification using timbral feature vectors and support vector machine , 2009, ICIS.

[16]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[17]  Douglas Eck,et al.  Learning Features from Music Audio with Deep Belief Networks , 2010, ISMIR.

[18]  Douglas Eck,et al.  Scalable Genre and Tag Prediction with Spectral Covariance , 2010, ISMIR.

[19]  Thierry Bertin-Mahieux,et al.  Automatic Tagging of Audio: The State-of-the-Art , 2011 .

[20]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.