On the use of the tempogram to describe audio content and its application to Music structural segmentation

This paper presents a new set of audio features to describe music content based on tempo cues. Tempogram, a mid-level representation of tempo information, is constructed to characterize tempo variation and local pulse in the audio signal. We introduce a collection of novel tempogram-based features inspired by musicological hypotheses about the relation between music structure and its rhythmic components prominent at different metrical levels. The strength of these features is demonstrated in music structural segmentation, an important task in Music information retrieval (MIR), using several published popular music datasets. Results indicate that incorporating tempo information into audio segmentation is a promising new direction.

[1]  Thomas Baer,et al.  A model for the prediction of thresholds, loudness, and partial loudness , 1997 .

[2]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[3]  Kristoffer Jensen,et al.  Multiple Scale Music Segmentation Using Rhythm, Timbre, and Harmony , 2007, EURASIP J. Adv. Signal Process..

[4]  Masataka Goto,et al.  A chorus-section detecting method for musical audio signals , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[5]  Jordan B. L. Smith,et al.  Design and creation of a large-scale database of structural annotations , 2011, ISMIR.

[6]  Matthew E. P. Davies,et al.  Causal Tempo Tracking of Audio , 2004, ISMIR.

[7]  Jonathan Foote,et al.  Media segmentation using self-similarity decomposition , 2003, IS&T/SPIE Electronic Imaging.

[8]  Anssi Klapuri,et al.  State of the Art Report: Audio-Based Music Structure Analysis , 2010, ISMIR.

[9]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[10]  Peter Grosche,et al.  Extracting Predominant Local Pulse Information From Music Recordings , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Mark B. Sandler,et al.  Structural Segmentation of Musical Audio by Constrained Clustering , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Rocha Bruno,et al.  DETECTION OF STRUCTURAL BOUNDARIES IN ELECTRONIC DANCE MUSIC - MIREX 2013 , 2014 .

[13]  G. Widmer,et al.  MAXIMUM FILTER VIBRATO SUPPRESSION FOR ONSET DETECTION , 2013 .

[14]  R. Parncutt A Perceptual Model of Pulse Salience and Metrical Accent in Musical Rhythms , 1994 .

[15]  Peter Grosche,et al.  Cyclic tempogram—A mid-level tempo representation for musicsignals , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Meinard Müller,et al.  A Cross-version Approach for Stabilizing Tempo-based Novelty Detection , 2012, ISMIR.

[17]  György Fazekas,et al.  Design And Evaluation of Onset Detectors using Different Fusion Policies , 2014, ISMIR.