Modeling Music as a Dynamic Texture

We consider representing a short temporal fragment of musical audio as a dynamic texture, a model of both the timbral and rhythmical qualities of sound, two of the important aspects required for automatic music analysis. The dynamic texture model treats a sequence of audio feature vectors as a sample from a linear dynamical system. We apply this new representation to the task of automatic song segmentation. In particular, we cluster audio fragments, extracted from a song, as samples from a dynamic texture mixture (DTM) model. We show that the DTM model can both accurately cluster coherent segments in music and detect transition boundaries. Moreover, the generative character of the proposed model of music makes it amenable for a wide range of applications besides segmentation. As examples, we use DTM models of songs to suggest possible improvements in other music information retrieval applications such as music annotation and similarity.

[1]  Dietmar Bauer,et al.  Comparing the CCA Subspace Method to Pseudo Maximum Likelihood Methods in the case of No Exogenous Inputs , 2005 .

[2]  Peter Knees,et al.  A music search engine built upon audio-based and web-based similarity measures , 2007, SIGIR.

[3]  Chin-Hui Lee,et al.  A Study on Music Genre Classification Based on Universal Acoustic Models , 2006, ISMIR.

[4]  Thierry Bertin-Mahieux,et al.  Automatic Generation of Social Tags for Music Recommendation , 2007, NIPS.

[5]  Antoni B. Chan,et al.  Audio Information Retrieval using Semantic Similarity , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Arbee L. P. Chen,et al.  Discovering nontrivial repeating patterns in music data , 2001, IEEE Trans. Multim..

[7]  Mark B. Sandler,et al.  Structural Segmentation of Musical Audio by Constrained Clustering , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  S. Kay Fundamentals of statistical signal processing: estimation theory , 1993 .

[9]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[10]  Masataka Goto,et al.  AIST Annotation for the RWC Music Database , 2006, ISMIR.

[11]  Barry Vercoe,et al.  Music thumbnailing via structural analysis , 2003, ACM Multimedia.

[12]  Payam Saisan,et al.  Dynamic texture recognition , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[13]  François Pachet,et al.  "The way it Sounds": timbre models for analysis and retrieval of music signals , 2005, IEEE Transactions on Multimedia.

[14]  Gert R. G. Lanckriet,et al.  Five Approaches to Collecting Tags for Music , 2008, ISMIR.

[15]  Nuno Vasconcelos,et al.  Variational layered dynamic textures , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Xavier Rodet,et al.  Toward Automatic Music Audio Summary Generation from Signal Analysis , 2002, ISMIR.

[17]  Nuno Vasconcelos,et al.  Layered Dynamic Textures , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Daniel Cremers,et al.  Dynamic texture segmentation , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  Bart De Moor,et al.  N4SID: Subspace algorithms for the identification of combined deterministic-stochastic systems , 1994, Autom..

[21]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[22]  Antoni B. Chan,et al.  Dynamic texture models of music , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Masataka Goto,et al.  A Supervised Approach for Detecting Boundaries in Music Using Difference Features and Boosting , 2007, ISMIR.

[24]  G. H. Wakefield,et al.  To catch a chorus: using chroma-based representations for audio thumbnailing , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[25]  Daniel P. W. Ellis,et al.  Automatic Record Reviews , 2004, ISMIR.

[26]  J. Russell Core affect and the psychological construction of emotion. , 2003, Psychological review.

[27]  Nuno Vasconcelos,et al.  Modeling, Clustering, and Segmenting Video with Mixtures of Dynamic Textures , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Perfecto Herrera,et al.  Semantic Segmentation of Music audio Contents , 2005, ICMC.

[29]  Nuno Vasconcelos,et al.  Probabilistic kernels for the classification of auto-regressive visual processes , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[30]  Masataka Goto,et al.  Development of the RWC Music Database , 2004 .

[31]  Anssi Klapuri,et al.  Music Structure Analysis Using a Probabilistic Fitness Measure and an Integrated Musicological Model , 2008, ISMIR.

[32]  Emilia Gómez,et al.  Automatic Extraction of Musical Structure Using Pitch Class Distribution Features , 2006 .

[33]  Simon Dixon,et al.  MIREX 2006 Audio Beat Tracking Evaluation: BeatRoot , 2006 .

[34]  Geoffrey E. Hinton,et al.  Variational Learning for Switching State-Space Models , 2000, Neural Computation.

[35]  Jonathan Foote,et al.  Visualizing music and audio using self-similarity , 1999, MULTIMEDIA '99.

[36]  Stefano Soatto,et al.  Dynamic Textures , 2003, International Journal of Computer Vision.

[37]  R. Shumway,et al.  AN APPROACH TO TIME SERIES SMOOTHING AND FORECASTING USING THE EM ALGORITHM , 1982 .

[38]  Mark B. Sandler,et al.  Extraction of High-Level Musical Structure From Audio Data and Its Application to Thumbnail Generation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[39]  Mark B. Sandler,et al.  Using duration models to reduce fragmentation in audio segmentation , 2006, Machine Learning.

[40]  Masataka Goto,et al.  A chorus section detection method for musical audio signals and its application to a music listening station , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Beth Logan,et al.  Music summarization using key phrases , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[42]  Gert R. G. Lanckriet,et al.  AUTO-TAGGING MUSIC CONTENT WITH SEMANTIC MULTINOMIALS , 2008 .

[43]  Gert R. G. Lanckriet,et al.  Semantic Annotation and Retrieval of Music and Sound Effects , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  Wallace E. Larimore,et al.  Canonical variate analysis in identification, filtering, and adaptive control , 1990, 29th IEEE Conference on Decision and Control.

[45]  George Tzanetakis,et al.  An experimental comparison of audio tempo induction algorithms , 2006, IEEE Transactions on Audio, Speech, and Language Processing.