Towards real-time music auto-tagging using sparse features

Unsupervised feature learning algorithms such as sparse coding and deep belief networks have been shown a viable alternative to hand-crafted feature design for music information retrieval. Nevertheless, such algorithms are usually computationally expensive. This paper investigates techniques to accelerate sparse feature extraction and music classification. To study the trade-off between computational efficiency and accuracy, we compare state-of-the-art, dense audio features with sparse features computed using 1) sparse coding with a random dictionary, 2) randomized clustering forest, and 3) an extension of randomized clustering forest to temporal signals. For classifier training and prediction, we compare support vector machines with linear or non-linear kernel functions. We conduct evaluation on music auto-tagging for 140 genre/style tags using a subset of 7,799 songs of the CAL10k data set. Our result leads to an 11-fold speed increase with 3.45% accuracy loss comparing to dense features. With the proposed sparse features, the feature extraction and auto-tagging operations can be finished in 1 second per song, with 0.1302 tagging accuracy in mean average precision.

[1]  Alberto Del Bimbo,et al.  Deep networks for audio event classification in soccer videos , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[2]  Antoni B. Chan,et al.  Time Series Models for Semantic Music Annotation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Subhransu Maji,et al.  Efficient Classification for Additive Kernel SVMs , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Douglas Eck,et al.  Aggregate features and ADABOOST for music classification , 2006, Machine Learning.

[5]  Juhan Nam,et al.  Learning Sparse Feature Representations for Music Annotation and Retrieval , 2012, ISMIR.

[6]  Riccardo Miotto,et al.  A Generative Context Model for Semantic Music Annotation and Retrieval , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Shou-De Lin,et al.  Cost-Sensitive Multi-Label Learning for Audio Tag Annotation and Retrieval , 2011, IEEE Transactions on Multimedia.

[8]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[9]  Frédéric Jurie,et al.  Randomized Clustering Forests for Image Classification , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Petri Toiviainen,et al.  MIR in Matlab (II): A Toolbox for Musical Feature Extraction from Audio , 2007, ISMIR.

[11]  Yi-Hsuan Yang,et al.  Dual-layer bag-of-frames model for music genre classification , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Arnold W. M. Smeulders,et al.  Real-Time Visual Concept Classification , 2010, IEEE Transactions on Multimedia.

[13]  Thierry Bertin-Mahieux,et al.  Autotagger: A Model for Predicting Social Tags from Acoustic Features on Large Music Databases , 2008 .

[14]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[15]  Mark B. Sandler,et al.  Music Information Retrieval Using Social Tags and Audio , 2009, IEEE Transactions on Multimedia.

[16]  Densil Cabrera,et al.  ' PSYSOUND' : A COMPUTER PROGRAM FOR PSYCHOACOUSTICAL ANALYSIS , 1999 .

[17]  Don G. Bouwhuis,et al.  Toward a better understanding of the relation between music preference, listening behavior, and personality , 2012 .

[18]  Roberto Cipolla,et al.  Semantic texton forests for image categorization and segmentation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[20]  Yi-Hsuan Yang,et al.  Inferring personal traits from music listening history , 2012, MIRUM '12.

[21]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[22]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[23]  Douglas Eck,et al.  Learning Features from Music Audio with Deep Belief Networks , 2010, ISMIR.

[24]  Daniel P. W. Ellis,et al.  Signal Processing for Music Analysis , 2011, IEEE Journal of Selected Topics in Signal Processing.

[25]  Yi-Hsuan Yang,et al.  Supervised dictionary learning for music genre classification , 2012, ICMR.

[26]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[27]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[28]  Youngmoo E. Kim,et al.  Exploring automatic music annotation with "acoustically-objective" tags , 2010, MIR '10.

[29]  Gert R. G. Lanckriet,et al.  Semantic Annotation and Retrieval of Music using a Bag of Systems Representation , 2011, ISMIR.

[30]  Zhenghao Chen,et al.  On Random Weights and Unsupervised Feature Learning , 2011, ICML.

[31]  Razvan Pascanu,et al.  Contextual tag inference , 2011, TOMCCAP.

[32]  Gert R. G. Lanckriet,et al.  Semantic Annotation and Retrieval of Music and Sound Effects , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  David Demirdjian,et al.  Recognizing events with temporal random forests , 2009, ICMI-MLMI '09.

[34]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Yu-Gang Jiang,et al.  SUPER: towards real-time event recognition in internet videos , 2012, ICMR.