Cover Detection Using Dominant Melody Embeddings

Automatic cover detection -- the task of finding in an audio database all the covers of one or several query tracks -- has long been seen as a challenging theoretical problem in the MIR community and as an acute practical problem for authors and composers societies. Original algorithms proposed for this task have proven their accuracy on small datasets, but are unable to scale up to modern real-life audio corpora. On the other hand, faster approaches designed to process thousands of pairwise comparisons resulted in lower accuracy, making them unsuitable for practical use. In this work, we propose a neural network architecture that is trained to represent each track as a single embedding vector. The computation burden is therefore left to the embedding extraction -- that can be conducted offline and stored, while the pairwise comparison task reduces to a simple Euclidean distance computation. We further propose to extract each track's embedding out of its dominant melody representation, obtained by another neural network trained for this task. We then show that this architecture improves state-of-the-art accuracy both on small and large datasets, and is able to scale to query databases of thousands of tracks in a few seconds.

[1]  Thierry Bertin-Mahieux,et al.  Large-scale cover song recognition using hashed chroma landmarks , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[2]  Eamonn J. Keogh,et al.  SiMPle: Assessing Music Similarity Using Subsequences Joins , 2016, ISMIR.

[3]  Xiaoyu Qi,et al.  Triplet Convolutional Network for Music Version Identification , 2018, MMM.

[4]  Emilia Gómez,et al.  Tonal representations for music retrieval: from version identification to query-by-humming , 2012, International Journal of Multimedia Information Retrieval.

[5]  Hsin-Min Wang,et al.  Using the Similarity of Main Melodies to Identify Cover Versions of Popular Songs for Music Document Retrieval , 2008, J. Inf. Sci. Eng..

[6]  Daniel P. W. Ellis,et al.  Pruning subsequence search with attention-based embedding , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Emilia Gómez,et al.  The song remains the same: identifying versions of the same piece using tonal descriptors , 2006, ISMIR.

[8]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[9]  Benjamin Schrauwen,et al.  Deep content-based music recommendation , 2013, NIPS.

[10]  Pierre Baldi,et al.  Neural Networks for Fingerprint Recognition , 1993, Neural Computation.

[11]  Gert R. G. Lanckriet,et al.  Learning Content Similarity for Music Recommendation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Prem Seetharaman,et al.  Cover song identification with 2D Fourier Transform sequences , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Juan Pablo Bello,et al.  Audio-Based Cover Song Retrieval Using Approximate Chord Sequences: Testing Shifts, Gaps, Swaps and Beats , 2007, ISMIR.

[14]  Kihyuk Sohn,et al.  Improved Deep Metric Learning with Multi-class N-pair Loss Objective , 2016, NIPS.

[15]  Meinard Müller,et al.  Known Artist Live Song ID: A Hashprint Approach , 2016, ISMIR.

[16]  Daniel P. W. Ellis,et al.  Cover song detection: From high scores to general classification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Justin Salamon,et al.  Deep Salience Representations for F0 Estimation in Polyphonic Music , 2017, ISMIR.

[18]  Yann LeCun,et al.  Moving Beyond Feature Design: Deep Architectures and Automatic Feature Learning in Music Informatics , 2012, ISMIR.

[19]  Daniel P. W. Ellis,et al.  Identifying `Cover Songs' with Chroma Features and Dynamic Programming Beat Tracking , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[20]  Fraunhofer IDMT Langewiesener,et al.  Finding Cover Songs by Melodic Similarity Christian Sailer and , 2006 .

[21]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[22]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[23]  Emmanuel Vincent,et al.  Adaptive Harmonic Spectral Decomposition for Multiple Pitch Estimation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Iasonas Kokkinos,et al.  Discriminative Learning of Deep Convolutional Feature Point Descriptors , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Daniel G. Brown,et al.  BLAST for Audio Sequences Alignment: A Fast Scalable Cover Identification Tool , 2012, ISMIR.

[26]  Matija Marolt,et al.  A Mid-Level Representation for Melody-Based Retrieval in Audio Collections , 2008, IEEE Transactions on Multimedia.

[27]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Oriol Nieto,et al.  Data Driven and Discriminative Projections for Large-Scale Cover Song Identification , 2013, ISMIR.

[29]  Anssi Klapuri,et al.  Multiple Fundamental Frequency Estimation by Summing Harmonic Amplitudes , 2006, ISMIR.

[30]  Silvio Savarese,et al.  Deep Metric Learning via Lifted Structured Feature Embedding , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Xiang Yu,et al.  Deep Metric Learning via Lifted Structured Feature Embedding , 2016 .

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Thierry Bertin-Mahieux,et al.  Large-Scale Cover Song Recognition Using the 2D Fourier Transform Magnitude , 2012, ISMIR.

[34]  Emilia Gómez,et al.  Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[36]  Hsin-Min Wang,et al.  Query-By-Example Technique for Retrieving Cover Versions of Popular Songs with Similar Melodies , 2005, ISMIR.

[37]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[38]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[39]  Geoffroy Peeters,et al.  On the Use of U-Net for Dominant Melody Estimation in Polyphonic Music , 2019, 2019 International Workshop on Multilayer Music Representation and Processing (MMRP).

[40]  R. Andrzejak,et al.  Cross recurrence quantification for cover song identification , 2009 .

[41]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[42]  Mathieu Lagrange,et al.  Multimodal similarity between musical streams for cover version detection , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.