Time Series Models for Semantic Music Annotation

Many state-of-the-art systems for automatic music tagging model music based on bag-of-features representations which give little or no account of temporal dynamics, a key characteristic of the audio signal. We describe a novel approach to automatic music annotation and retrieval that captures temporal (e.g., rhythmical) aspects as well as timbral content. The proposed approach leverages a recently proposed song model that is based on a generative time series model of the musical content-the dynamic texture mixture (DTM) model-that treats fragments of audio as the output of a linear dynamical system. To model characteristic temporal dynamics and timbral content at the tag level, a novel, efficient, and hierarchical expectation-maximization (EM) algorithm for DTM (HEM-DTM) is used to summarize the common information shared by DTMs modeling individual songs associated with a tag. Experiments show learning the semantics of music benefits from modeling temporal dynamics.

[1]  Lior Wolf,et al.  Learning over Sets using Kernel Principal Angles , 2003, J. Mach. Learn. Res..

[2]  Antoni B. Chan Derivation of the Hierarchical EM algorithm for Dynamic Textures , 2010 .

[3]  Stefano Soatto,et al.  Editable dynamic textures , 2002, SIGGRAPH '02.

[4]  P. Cano,et al.  Automatic sound annotation , 2004, Proceedings of the 2004 14th IEEE Signal Processing Society Workshop Machine Learning for Signal Processing, 2004..

[5]  Malcolm Slaney,et al.  Analysis of Minimum Distances in High-Dimensional Musical Spaces , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Gaël Richard,et al.  Inferring Efficient Hierarchical Taxonomies for MIR Tasks: Application to Musical Instruments , 2005, ISMIR.

[7]  Malcolm Slaney,et al.  Mixtures of probability experts for audio retrieval and indexing , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[8]  Jeroen Breebaart,et al.  Features for audio and music classification , 2003, ISMIR.

[9]  R. Shumway,et al.  AN APPROACH TO TIME SERIES SMOOTHING AND FORECASTING USING THE EM ALGORITHM , 1982 .

[10]  B. De Moor,et al.  Subspace angles between linear stochastic models , 2000, CDC 2000.

[11]  Kilian Q. Weinberger,et al.  ISMIR 2008 – Session 3a – Content-Based Retrieval, Categorization and Similarity 1 LEARNING A METRIC FOR MUSIC SIMILARITY , 2022 .

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  Daniel P. W. Ellis,et al.  A Large-Scale Evaluation of Acoustic and Subjective Music-Similarity Measures , 2004, Computer Music Journal.

[14]  Perry R. Cook,et al.  Easy As CBA: A Simple Probabilistic Model for Tagging Music , 2009, ISMIR.

[15]  Nuno Vasconcelos,et al.  Modeling, Clustering, and Segmenting Video with Mixtures of Dynamic Textures , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Gert R. G. Lanckriet,et al.  Towards musical query-by-semantic-description using the CAL500 data set , 2007, SIGIR.

[17]  Ryan M. Rifkin,et al.  Musical query-by-description as a multiclass learning problem , 2002, 2002 IEEE Workshop on Multimedia Signal Processing..

[18]  Daniel P. W. Ellis,et al.  Automatic Record Reviews , 2004, ISMIR.

[19]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[20]  Thierry Bertin-Mahieux,et al.  Automatic Generation of Social Tags for Music Recommendation , 2007, NIPS.

[21]  Payam Saisan,et al.  Dynamic texture recognition , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[22]  Stefano Soatto,et al.  Dynamic Textures , 2003, International Journal of Computer Vision.

[23]  Gerhard Widmer,et al.  Probabilistic Combination of Features for Music Classification , 2006, ISMIR.

[24]  Antoni B. Chan,et al.  Modeling Music as a Dynamic Texture , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Gustavo Carneiro,et al.  Supervised Learning of Semantic Classes for Image Annotation and Retrieval , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  François Pachet,et al.  Music Similarity Measures: What's the use? , 2002, ISMIR.

[27]  George Tzanetakis,et al.  Improving automatic music tag annotation using stacked generalization of probabilistic SVM outputs , 2009, ACM Multimedia.

[28]  A.H. Haddad,et al.  Applied optimal estimation , 1976, Proceedings of the IEEE.

[29]  Chin-Hui Lee,et al.  A Study on Music Genre Classification Based on Universal Acoustic Models , 2006, ISMIR.

[30]  Masataka Goto,et al.  Recent studies on music information processing , 2004 .

[31]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[32]  Gerhard Widmer,et al.  Improvements of Audio-Based Music Similarity and Genre Classificaton , 2005, ISMIR.

[33]  Nuno Vasconcelos,et al.  Learning Mixture Hierarchies , 1998, NIPS.

[34]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[35]  Perry R. Cook,et al.  Content-Based Musical Similarity Computation using the Hierarchical Dirichlet Process , 2008, ISMIR.

[36]  Daniel P. W. Ellis,et al.  Multiple-Instance Learning for Music Information Retrieval , 2008, ISMIR.

[37]  Youngmoo E. Kim,et al.  Exploring automatic music annotation with "acoustically-objective" tags , 2010, MIR '10.

[38]  Antoni B. Chan,et al.  Clustering dynamic textures with the hierarchical EM algorithm , 2013, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[39]  Nuno Vasconcelos,et al.  Probabilistic kernels for the classification of auto-regressive visual processes , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[40]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[41]  Gert R. G. Lanckriet,et al.  Combining Feature Kernels for Semantic Music Retrieval , 2008, ISMIR.

[42]  Gert R. G. Lanckriet,et al.  Semantic Annotation and Retrieval of Music and Sound Effects , 2008, IEEE Transactions on Audio, Speech, and Language Processing.