A Bag of Systems Representation for Music Auto-Tagging

We present a content-based automatic tagging system for music that relies on a high-level, concise “Bag of Systems” (BoS) representation of the characteristics of a musical piece. The BoS representation leverages a rich dictionary of musical codewords, where each codeword is a generative model that captures timbral and temporal characteristics of music. Songs are represented as a BoS histogram over codewords, which allows for the use of traditional algorithms for text document retrieval to perform auto-tagging. Compared to estimating a single generative model to directly capture the musical characteristics of songs associated with a tag, the BoS approach offers the flexibility to combine different generative models at various time resolutions through the selection of the BoS codewords. Additionally, decoupling the modeling of audio characteristics from the modeling of tag-specific patterns makes BoS a more robust and rich representation of music. Experiments show that this leads to superior auto-tagging performance.

[1]  Nima Mesgarani,et al.  Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Stefano Soatto,et al.  Dynamic Textures , 2003, International Journal of Computer Vision.

[3]  Perry R. Cook,et al.  Easy As CBA: A Simple Probabilistic Model for Tagging Music , 2009, ISMIR.

[4]  Nuno Vasconcelos,et al.  Modeling, Clustering, and Segmenting Video with Mixtures of Dynamic Textures , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Zhouyu Fu,et al.  Music classification via the bag-of-features approach , 2011, Pattern Recognit. Lett..

[6]  Michael J. Swain,et al.  Color indexing , 1991, International Journal of Computer Vision.

[7]  Gaël Richard,et al.  Temporal Integration for Audio Classification With Application to Musical Instrument Classification , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Bob L. Sturm,et al.  Musical instrument identification using multiscale Mel-frequency cepstral coefficients , 2010, 2010 18th European Signal Processing Conference.

[9]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[10]  George Tzanetakis,et al.  Improving automatic music tag annotation using stacked generalization of probabilistic SVM outputs , 2009, ACM Multimedia.

[11]  Youngmoo E. Kim,et al.  Exploring automatic music annotation with "acoustically-objective" tags , 2010, MIR '10.

[12]  D. Aldous Exchangeability and related topics , 1985 .

[13]  Douglas E. Sturim,et al.  Speaker indexing in large audio databases using anchor models , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[14]  Antoni B. Chan,et al.  Growing a bag of systems tree for fast and accurate classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Gert R. G. Lanckriet,et al.  Semantic Annotation and Retrieval of Music using a Bag of Systems Representation , 2011, ISMIR.

[16]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[17]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[18]  Mark B. Sandler,et al.  A Semantic Space for Music Derived from Social Tags , 2007, ISMIR.

[19]  Dacheng Tao,et al.  Music Tagging with Regularized Logistic Regression , 2011, ISMIR.

[20]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[21]  Antoni B. Chan,et al.  Clustering dynamic textures with the hierarchical EM algorithm , 2013, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[23]  Perry R. Cook,et al.  Content-Based Musical Similarity Computation using the Hierarchical Dirichlet Process , 2008, ISMIR.

[24]  Daniel P. W. Ellis,et al.  Multiple-Instance Learning for Music Information Retrieval , 2008, ISMIR.

[25]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[26]  Gerhard Widmer,et al.  Improvements of Audio-Based Music Similarity and Genre Classificaton , 2005, ISMIR.

[27]  Nuno Vasconcelos,et al.  Learning Mixture Hierarchies , 1998, NIPS.

[28]  Mathieu Lagrange,et al.  Multi-scale temporal fusion by boosting for music classification , 2011, ISMIR.

[29]  Gert R. G. Lanckriet,et al.  Semantic Annotation and Retrieval of Music and Sound Effects , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Antoni B. Chan,et al.  Time Series Models for Semantic Music Annotation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Douglas Eck,et al.  Temporal Pooling and Multiscale Learning for Automatic Annotation and Ranking of Music Audio , 2011, ISMIR.

[32]  Chin-Hui Lee,et al.  A Study on Music Genre Classification Based on Universal Acoustic Models , 2006, ISMIR.

[33]  Constantine Kotropoulos,et al.  Non-Negative Multilinear Principal Component Analysis of Auditory Temporal Modulations for Music Genre Classification , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Tony Jebara,et al.  Probability Product Kernels , 2004, J. Mach. Learn. Res..

[35]  Malcolm Slaney,et al.  Analysis of Minimum Distances in High-Dimensional Musical Spaces , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Markus Schedl,et al.  Hybrid retrieval approaches to geospatial music recommendation , 2013, SIGIR.

[37]  José Paulo Leal,et al.  Combining usage and content in an online music recommendation system for music in the long-tail , 2012, WWW.

[38]  Thierry Bertin-Mahieux,et al.  Automatic Generation of Social Tags for Music Recommendation , 2007, NIPS.

[39]  Joakim Andén,et al.  Multiscale Scattering for Audio Classification , 2011, ISMIR.

[40]  Antoni B. Chan,et al.  The variational hierarchical EM algorithm for clustering hidden Markov models , 2012, NIPS.