Efficient Music Identification Approach Based on Local Spectrogram Image Descriptors

The diffusion of large music collections has determined the need for algorithms enabling fast song retrieval from query audio excerpts. This is the case of online media sharing platforms that may want to detect copyrighted material. In this paper, we start from a proposed state-of-the-art algorithm for robust music matching based on spectrogram comparison leveraging computer vision concepts. We show that it is possible to further optimize this algorithm exploiting more recent image processing techniques and carrying out the analysis on limited temporal windows, still achieving accurate matching performance. The proposed solution is validated on a dataset of 800 songs, reporting an 80% decrease in computational complexity for an accuracy loss of about only 1%.

[1]  Augusto Sarti,et al.  A music search engine based on semantic text-based query , 2013, 2013 IEEE 15th International Workshop on Multimedia Signal Processing (MMSP).

[2]  Wei Wang,et al.  SIFT-based local spectrogram image descriptor: a novel feature for robust music identification , 2015, EURASIP Journal on Audio, Speech, and Music Processing.

[3]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[4]  Gerhard Widmer,et al.  Robust Quad-Based Audio Fingerprinting , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Avery Wang,et al.  An Industrial Strength Audio Search Algorithm , 2003, ISMIR.

[6]  Daniel P. W. Ellis,et al.  Identifying `Cover Songs' with Chroma Features and Dynamic Programming Beat Tracking , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  Peter Knees,et al.  Introduction to Music Similarity and Retrieval , 2016 .

[8]  Marc Leman,et al.  Content-Based Music Information Retrieval: Current Directions and Future Challenges , 2008, Proceedings of the IEEE.

[9]  Augusto Sarti,et al.  Feature-based classification for audio bootlegs detection , 2013, 2013 IEEE International Workshop on Information Forensics and Security (WIFS).

[10]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[11]  Jaap A. Haitsma,et al.  Robust Audio Hashing for Content Identification , 2001 .

[12]  Derek Hoiem,et al.  Computer vision for music identification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[13]  Thierry Bertin-Mahieux,et al.  Large-scale cover song recognition using hashed chroma landmarks , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[14]  Marc Van Droogenbroeck,et al.  Enhancing Cover Song Identification with Hierarchical Rank Aggregation , 2016, ISMIR.

[15]  Daniel P. W. Ellis,et al.  The 2007 LabROSA Cover Song Detection System , 2007 .

[16]  Michael A. Casey,et al.  The Importance of Sequences in Musical Similarity , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[17]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[18]  Emilia Gómez,et al.  Audio cover song identification based on tonal sequence alignment , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Shrikanth S. Narayanan,et al.  Dynamic chroma feature vectors with applications to cover song identification , 2008, 2008 IEEE 10th Workshop on Multimedia Signal Processing.

[20]  P M Panchal,et al.  A Comparison of SIFT and SURF , 2013 .

[21]  Emilia Gómez,et al.  Audio Cover Song Identification and Similarity: Background, Approaches, Evaluation, and Beyond , 2010, Advances in Music Information Retrieval.

[22]  Cordelia Schmid,et al.  A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Eric Allamanche,et al.  Content-based Identification of Audio Material Using MPEG-7 Low Level Description , 2001, ISMIR.

[24]  Roland Siegwart,et al.  BRISK: Binary Robust invariant scalable keypoints , 2011, 2011 International Conference on Computer Vision.

[25]  Ton Kalker,et al.  A Highly Robust Audio Fingerprinting System , 2002, ISMIR.

[26]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[27]  João Ascenso,et al.  Evaluation of low-complexity visual feature detectors and descriptors , 2013, 2013 18th International Conference on Digital Signal Processing (DSP).