Disambiguating Music Artists at Scale with Audio Metric Learning

We address the problem of disambiguating large scale catalogs through the definition of an unknown artist clustering task. We explore the use of metric learning techniques to learn artist embeddings directly from audio, and using a dedicated homonym artists dataset, we compare our method with a recent approach that learn similar embeddings using artist classifiers. While both systems have the ability to disambiguate unknown artists relying exclusively on audio, we show that our system is more suitable in the case when enough audio data is available for each artist in the train dataset. We also propose a new negative sampling method for metric learning that takes advantage of side information such as music genre during the learning phase and shows promising results for the artist clustering task.

[1]  Hervé Bredin,et al.  TristouNet: Triplet loss for speaker turn embedding , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[3]  Jian Wang,et al.  Deep Metric Learning with Angular Loss , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Daniel Müllner,et al.  Modern hierarchical, agglomerative clustering algorithms , 2011, ArXiv.

[5]  Ali Shokoufandeh,et al.  Multimodal Music and Lyrics Fusion Classifier for Artist Identification , 2014, 2014 13th International Conference on Machine Learning and Applications.

[6]  Benjamin Schrauwen,et al.  Audio-based Music Classification with a Pretrained Convolutional Network , 2011, ISMIR.

[7]  Xiang Yu,et al.  Deep Metric Learning via Lifted Structured Feature Embedding , 2016 .

[8]  Aren Jansen,et al.  Unsupervised Learning of Semantic Audio Representations , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Pavel P. Kuksa,et al.  Efficient multivariate sequence classification , 2014, ArXiv.

[10]  Gaël Richard,et al.  Group Non-Negative Matrix Factorisation With Speaker And Session Similarity Constraints For Speaker Identification , 2016, ICASSP 2016.

[11]  Daniel P. W. Ellis,et al.  Anchor space for classification and similarity measurement of music , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[12]  Jason Weston,et al.  Large-Scale Music Annotation and Retrieval: Learning to Rank in Joint Semantic Spaces , 2011, ArXiv.

[13]  Sajad Shirali-Shahreza,et al.  Fast and scalable system for automatic artist identification , 2009, IEEE Transactions on Consumer Electronics.

[14]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[15]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[16]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[17]  Markus Schedl,et al.  I-Vectors for Timbre-Based Music Similarity and Music Artist Classification , 2015, ISMIR.

[18]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[19]  Juhan Nam,et al.  Representation Learning of Music Using Artist Labels , 2018, ISMIR.

[20]  Markus Schedl,et al.  Timbral modeling for music artist recognition using i-vectors , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[21]  Gerhard Widmer,et al.  Noise Robust Music Artist Recognition Using I-Vector Features , 2016, ISMIR.

[22]  Yi-Hsuan Yang,et al.  Sparse Modeling for Artist Identification: Exploiting Phase Information and Vocal Separation , 2013, ISMIR.

[23]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[24]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..