Unsupervised feature learning for Music Structural Analysis

Music Structural Analysis (MSA) algorithms analyze songs with the purpose of automatically retrieving their large-scale structure. They do so from a feature-based representation of the audio signal (e.g., MFCCs, chromagram), which is usually hand-designed for that specific application. In order to design a proper audio representation for MSA, we need to assess which musical properties are relevant for segmentation purposes (e.g., timbre, harmony); and we need to design signal processing strategies that can be used for capturing them. Deep learning techniques offer an alternative to this approach, as they are able to automatically find an abstract representation of the musical content. In this work we investigate their use in the task of Music Structural Analysis. In particular, we compare the performance of several state-of-the-art algorithms working with a collection of traditional descriptors and by descriptors that are extracted with a Deep Belief Network.

[1]  Daniel P. W. Ellis,et al.  Analyzing Song Structure with Spectral Clustering , 2014, ISMIR.

[2]  Jonathan Foote,et al.  Automatic audio segmentation using a measure of audio novelty , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[3]  Colin Raffel,et al.  librosa: v0.4.0 , 2015 .

[4]  Xavier Rodet,et al.  Toward Automatic Music Audio Summary Generation from Signal Analysis , 2002, ISMIR.

[5]  24th European Signal Processing Conference, EUSIPCO 2016, Budapest, Hungary, August 29 - September 2, 2016 , 2016, European Signal Processing Conference.

[6]  Martin F. McKinney,et al.  Structural boundary perception in popular music , 2006, ISMIR.

[7]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[8]  Augusto Sarti,et al.  Unsupervised feature learning for bootleg detection using deep learning architectures , 2014, 2014 IEEE International Workshop on Information Forensics and Security (WIFS).

[9]  Oriol Nieto,et al.  Convex non-negative matrix factorization for automatic music structure identification , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Thomas Grill,et al.  Boundary Detection in Music Structure Analysis using Convolutional Neural Networks , 2014, ISMIR.

[11]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[12]  Peter Grosche,et al.  Unsupervised Music Structure Annotation by Time Series Structure Features and Segment Similarity , 2014, IEEE Transactions on Multimedia.

[13]  Masataka Goto,et al.  A Supervised Approach for Detecting Boundaries in Music Using Difference Features and Boosting , 2007, ISMIR.

[14]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[15]  Kristoffer Jensen,et al.  Multiple Scale Music Segmentation Using Rhythm, Timbre, and Harmony , 2007, EURASIP J. Adv. Signal Process..

[16]  Augusto Sarti,et al.  A system for dynamic playlist generation driven by multimodal control signals and descriptors , 2011, 2011 IEEE 13th International Workshop on Multimedia Signal Processing.

[17]  Douglas Eck,et al.  Learning Features from Music Audio with Deep Belief Networks , 2010, ISMIR.

[18]  Augusto Sarti,et al.  A Dimensional Contextual Semantic Model for music description and retrieval , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Jordan B. L. Smith,et al.  Design and creation of a large-scale database of structural annotations , 2011, ISMIR.

[20]  Oriol Nieto,et al.  Music segment similarity using 2D-Fourier Magnitude Coefficients , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).