Multivariate Autoregressive Mixture Models for Music Auto-Tagging

We propose the multivariate autoregressive model for content based music auto-tagging. At the song level our approach leverages the multivariate autoregressive mixture (ARM) model, a generative time-series model for audio, which assumes each feature vector in an audio fragment is a linear function of previous feature vectors. To tackle tagmodel estimation, we propose an efficient hierarchical EM algorithm for ARMs (HEM-ARM), which summarizes the acoustic information common to the ARMs modeling the individual songs associated with a tag. We compare the ARM model with the recently proposed dynamic texture mixture (DTM) model. We hence investigate the relative merits of different modeling choices for music time-series: i) the flexibility of selecting higher memory order in ARM, ii) the capability of DTM to learn specific frequency basis for each particular tag and iii) the effect of the hidden layer of the DT versus the time efficiency of learning and inference with fully observable AR components. Finally, we experiment with a support vector machine (SVM) approach that classifies songs based on a kernel calculated on the frequency responses of the corresponding song ARMs. We show that the proposed approach outperforms SVMs trained on a different kernel function, based on a competing generative model.

[1]  Daniel P. W. Ellis,et al.  Automatic Record Reviews , 2004, ISMIR.

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Perry R. Cook,et al.  Easy As CBA: A Simple Probabilistic Model for Tagging Music , 2009, ISMIR.

[4]  Gert R. G. Lanckriet,et al.  Combining Feature Kernels for Semantic Music Retrieval , 2008, ISMIR.

[5]  Arnold Neumaier,et al.  Estimation of parameters and eigenmodes of multivariate autoregressive models , 2001, TOMS.

[6]  Gert R. G. Lanckriet,et al.  Semantic Annotation and Retrieval of Music and Sound Effects , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Ankur Agarwal,et al.  Tracking Articulated Motion Using a Mixture of Autoregressive Models , 2004, ECCV.

[8]  Gert R. G. Lanckriet,et al.  Semantic Annotation and Retrieval of Music using a Bag of Systems Representation , 2011, ISMIR.

[9]  Edith Law,et al.  Input-agreement: a new mechanism for collecting data using human computation games , 2009, CHI.

[10]  Kilian Q. Weinberger,et al.  ISMIR 2008 – Session 3a – Content-Based Retrieval, Categorization and Similarity 1 LEARNING A METRIC FOR MUSIC SIMILARITY , 2022 .

[11]  Tony Jebara,et al.  Probability Product Kernels , 2004, J. Mach. Learn. Res..

[12]  Gaël Richard,et al.  Temporal Integration for Audio Classification With Application to Musical Instrument Classification , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Antoni B. Chan,et al.  Time Series Models for Semantic Music Annotation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Douglas Eck,et al.  Temporal Pooling and Multiscale Learning for Automatic Annotation and Ranking of Music Audio , 2011, ISMIR.

[15]  Chin-Hui Lee,et al.  A Study on Music Genre Classification Based on Universal Acoustic Models , 2006, ISMIR.

[16]  Youngmoo E. Kim,et al.  Exploring automatic music annotation with "acoustically-objective" tags , 2010, MIR '10.

[17]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[18]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[19]  Daniel P. W. Ellis,et al.  Multiple-Instance Learning for Music Information Retrieval , 2008, ISMIR.

[20]  Thierry Bertin-Mahieux,et al.  Automatic Generation of Social Tags for Music Recommendation , 2007, NIPS.

[21]  Gert R. G. Lanckriet,et al.  User-centered design of a social game to tag music , 2009, HCOMP '09.

[22]  Gerhard Widmer,et al.  Probabilistic Combination of Features for Music Classification , 2006, ISMIR.

[23]  Nuno Vasconcelos,et al.  Learning Mixture Hierarchies , 1998, NIPS.

[24]  John Shawe-Taylor,et al.  An Investigation of Feature Models for Music Genre Classification Using the Support Vector Classifier , 2005, ISMIR.

[25]  Malcolm Slaney,et al.  Analysis of Minimum Distances in High-Dimensional Musical Spaces , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Antoni B. Chan,et al.  Clustering dynamic textures with the hierarchical EM algorithm , 2013, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.