Unsupervised Learning of Deep Features for Music Segmentation

Music segmentation refers to the dual problem of identifying boundaries between, and labeling, distinct music segments, e.g., the chorus, verse, bridge etc. in popular music. The performance of a range of music segmentation algorithms has been shown to be dependent on the audio features chosen to represent the audio. Some approaches have proposed learning feature transformations from music segment annotation data, although, such data is time consuming or expensive to create and as such these approaches are likely limited by the size of their datasets. While annotated music segmentation data is a scarce resource, the amount of available music audio is much greater. In the neighboring field of semantic audio unsupervised deep learning has shown promise in improving the performance of solutions to the query-by-example and sound classification tasks. In this work, unsupervised training of deep feature embeddings using convolutional neural networks (CNNs) is explored for music segmentation. The proposed techniques exploit only the time proximity of audio features that is implicit in any audio timeline. Employing these embeddings in a classic music segmentation algorithm is shown not only to significantly improve the performance of this algorithm, but obtain state of the art performance in unsupervised music segmentation.

[1]  Oriol Nieto,et al.  Perceptual Analysis of the F-Measure to Evaluate Section Boundaries in Music , 2014, ISMIR.

[2]  Jordan B. L. Smith,et al.  Design and creation of a large-scale database of structural annotations , 2011, ISMIR.

[3]  Thomas Sikora,et al.  Music Structure Discovery in Popular Music using Non-negative Matrix Factorization , 2010, ISMIR.

[4]  Daniel P. W. Ellis,et al.  Analyzing Song Structure with Spectral Clustering , 2014, ISMIR.

[5]  Oriol Nieto,et al.  Systematic Exploration of Computational Music Structure Research , 2016, ISMIR.

[6]  Shlomo Dubnov,et al.  Re-Visiting the Music Segmentation Problem with Crowdsourcing , 2017, ISMIR.

[7]  Ron J. Weiss,et al.  Unsupervised Discovery of Temporal Structure in Music , 2011, IEEE Journal of Selected Topics in Signal Processing.

[8]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Wei Chai,et al.  Semantic segmentation and summarization of music: methods based on tonality and recurrent structure , 2006, IEEE Signal Processing Magazine.

[10]  Johan Pauwels,et al.  Combining Harmony-Based and Novelty-Based Approaches for Structural Segmentation , 2013, ISMIR.

[11]  Oriol Nieto,et al.  Music segment similarity using 2D-Fourier Magnitude Coefficients , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Christian Schörkhuber CONSTANT-Q TRANSFORM TOOLBOX FOR MUSIC PROCESSING , 2010 .

[13]  Daniel P. W. Ellis,et al.  Learning to segment songs with ordinal linear discriminant analysis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Justin Salamon,et al.  Unsupervised feature learning for urban sound classification , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Peter Grosche,et al.  Unsupervised Music Structure Annotation by Time Series Structure Features and Segment Similarity , 2014, IEEE Transactions on Multimedia.

[16]  Oriol Nieto,et al.  Convex non-negative matrix factorization for automatic music structure identification , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Oriol Nieto,et al.  Perceptual analysis of the f-measure for evaluating section boundaries in music: Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR 2014) , 2014 .

[18]  Matija Marolt,et al.  A Mid-level Melody-based Representation for Calculating Audio Similarity , 2006, ISMIR.

[19]  Mark B. Sandler,et al.  Structural Segmentation of Musical Audio by Constrained Clustering , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[21]  Thomas Grill,et al.  Boundary Detection in Music Structure Analysis using Convolutional Neural Networks , 2014, ISMIR.

[22]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Hanna M. Lukashevich Towards Quantitative Measures of Evaluating Song Segmentation , 2008, ISMIR.

[24]  Thomas Grill,et al.  Music boundary detection using neural networks on spectrograms and self-similarity lag matrices , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[25]  Aren Jansen,et al.  Unsupervised Learning of Semantic Audio Representations , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Alexander J. Smola,et al.  Sampling Matters in Deep Embedding Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Kristoffer Jensen,et al.  Multiple Scale Music Segmentation Using Rhythm, Timbre, and Harmony , 2007, EURASIP J. Adv. Signal Process..

[28]  Jaakko Astola,et al.  Analysis of the meter of acoustic musical signals , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Sebastian Böck,et al.  Improved musical onset detection with Convolutional Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  A. Klapuri,et al.  Music structure analysis by finding repeated parts , 2006, AMCMM '06.

[31]  Justin Salamon,et al.  Feature learning with deep scattering for urban sound analysis , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[32]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[33]  Jonathan Foote,et al.  Automatic audio segmentation using a measure of audio novelty , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).