I-Vectors for Timbre-Based Music Similarity and Music Artist Classification

In this paper, we present a novel approach to extract songlevel descriptors built from frame-level timbral features such as Mel-frequency cepstral coefficient (MFCC). These descriptors are called identity vectors or i-vectors and are the results of a factor analysis procedure applied on framelevel features. The i-vectors provide a low-dimensional and fixed-length representation for each song and can be used in a supervised and unsupervised manner. First, we use the i-vectors for an unsupervised music similarity estimation, where we calculate the distance between i-vectors in order to predict the genre of songs. Second, for a supervised artist classification task we report the performance measures using multiple classifiers trained on the i-vectors. Standard datasets for each task are used to evaluate our method and the results are compared with the state of the art. By only using timbral information, we already achieved the state of the art performance in music similarity (which uses extra information such as rhythm). In artist classification using timbre descriptors, our method outperformed the state of the art.

[1]  Pavel P. Kuksa Efficient multivariate kernels for sequence classification , 2014 .

[2]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[3]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[4]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[5]  Patrick Kenny,et al.  Speaker and Session Variability in GMM-Based Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Elias Pampalk,et al.  Audio-Based Music Similarity and Retrieval : Combining a Spectral Similarity Model with Information Extracted from Fluctuation Patterns , 2006 .

[7]  Larry P. Heck,et al.  MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker Recognition Research , 2013 .

[8]  Rui Xia,et al.  Using i-Vector Space Model for Emotion Recognition , 2012, INTERSPEECH.

[9]  Najim Dehak,et al.  Discriminative and generative approaches for long- and short-term speaker characteristics modeling: application to speaker verification , 2009 .

[10]  Klaus Seyerlehner FUSING BLOCK-LEVEL FEATURES FOR MUSIC SIMILARITY ESTIMATION , 2010 .

[11]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[12]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Pavel P. Kuksa,et al.  Efficient multivariate sequence classification , 2014, ArXiv.

[14]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[15]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[17]  Driss Matrouf,et al.  A straightforward and efficient implementation of the factor analysis model for speaker verification , 2007, INTERSPEECH.

[18]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[19]  Ming Li,et al.  THINKIT'S SUBMISSIONS FOR MIREX2009 AUDIO MUSIC CLASSIFICATION AND SIMILARITY TASKS , 2009 .

[20]  Peter Knees,et al.  On Rhythm and General Music Similarity , 2009, ISMIR.

[21]  Markus Schedl,et al.  Timbral modeling for music artist recognition using i-vectors , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[22]  D. Schnitzer,et al.  STRIVING FOR AN IMPROVED AUDIO SIMILARITY MEASURE , 2007 .

[23]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[24]  Daniel P. W. Ellis,et al.  Classifying Music Audio with Timbral and Chroma Features , 2007, ISMIR.

[25]  Sajad Shirali-Shahreza,et al.  Fast and scalable system for automatic artist identification , 2009, IEEE Transactions on Consumer Electronics.

[26]  Patrick Kenny,et al.  New MAP estimators for speaker recognition , 2003, INTERSPEECH.

[27]  G. Peeters,et al.  GMM SUPERVECTOR FOR CONTENT BASED MUSIC SIMILARITY , 2011 .

[28]  Gerald Friedland,et al.  An i-Vector Representation of Acoustic Environments for Audio-Based Video Event Detection on User Generated Content , 2013, 2013 IEEE International Symposium on Multimedia.

[29]  Hugo Van hamme,et al.  Accent recognition using i-vector, Gaussian Mean Supervector and Gaussian posterior probability supervector for spontaneous telephone speech , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Driss Matrouf,et al.  Intersession Compensation and Scoring Methods in the i-vectors Space for Speaker Recognition , 2011, INTERSPEECH.

[31]  Yi-Hsuan Yang,et al.  Sparse Modeling for Artist Identification: Exploiting Phase Information and Vocal Separation , 2013, ISMIR.

[32]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).