Modeling nuisance variabilities with factor analysis for GMM-based audio pattern classification

Abstract: Audio pattern classification represents a particular statistical classification task and includes, for example, speaker recognition, language recognition, emotion recognition, speech recognition and, recently, video genre classification. The feature being used in all these tasks is generally based on a short-term cepstral representation. The cepstral vectors contain at the same time useful information and nuisance variability, which are difficult to separate in this domain. Recently, in the context of GMM-based recognizers, a novel approach using a Factor Analysis (FA) paradigm has been proposed for decomposing the target model into a useful information component and a session variability component. This approach is called Joint Factor Analysis (JFA), since it models jointly the nuisance variability and the useful information, using the FA statistical method. The JFA approach has even been combined with Support Vector Machines, known for their discriminative power. In this article, we successfully apply this paradigm to three automatic audio processing applications: speaker verification, language recognition and video genre classification. This is done by applying the same process and using the same free software toolkit. We will show that this approach allows for a relative error reduction of over 50% in all the aforementioned audio processing tasks.

[1]  Jean-François Bonastre,et al.  ALIZE, a free toolkit for speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[2]  Lukás Burget,et al.  Discriminative training and channel compensation for acoustic language recognition , 2008, INTERSPEECH.

[3]  Lukás Burget,et al.  Comparison of scoring methods used in speaker recognition with Joint Factor Analysis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Driss Matrouf,et al.  A straightforward and efficient implementation of the factor analysis model for speaker verification , 2007, INTERSPEECH.

[5]  Roach Matthew Classification of Non-edited Broadcast Video Using Holistic Low-level Features , 2002 .

[6]  Svetha Venkatesh,et al.  Horror film genre typing and scene labeling via audio analysis , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[7]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[8]  Sadaoki Furui,et al.  International Speech Communication Association , 2006 .

[9]  G.R. Doddington,et al.  Speaker recognition—Identifying people by their voices , 1985, Proceedings of the IEEE.

[10]  Kai Feng,et al.  Subspace Gaussian Mixture Models for speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  R.S. Jasinschi,et al.  Automatic TV program genre classification based on audio patterns , 2001, Proceedings 27th EUROMICRO Conference. 2001: A Net Odyssey.

[12]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[13]  Lukás Burget,et al.  Brno University of Technology system for Interspeech 2009 emotion challenge , 2009, INTERSPEECH.

[14]  Douglas E. Sturim,et al.  SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[15]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[17]  Diane J. Cook,et al.  Automatic Video Classification: A Survey of the Literature , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[18]  Weiyu Zhu,et al.  Automatic news video segmentation and categorization based on closed-captioned text , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[19]  Pierre Dumouchel,et al.  Cepstral and long-term features for emotion recognition , 2009, INTERSPEECH.

[20]  Lukás Burget,et al.  Discriminative acoustic language recognition via channel-compensated GMM statistics , 2009, INTERSPEECH.

[21]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[22]  John S. D. Mason,et al.  Classification of video genre using audio , 2001, INTERSPEECH.

[23]  Lukás Burget,et al.  Brno University of Technology System for NIST 2005 Language Recognition Evaluation , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[24]  B.S. Atal,et al.  Automatic recognition of speakers from their voices , 1976, Proceedings of the IEEE.

[25]  Patrick Kenny,et al.  Experiments in speaker verification using factor analysis likelihood ratios , 2004, Odyssey.

[26]  Lukás Burget,et al.  Investigation into variants of joint factor analysis for speaker recognition , 2009, INTERSPEECH.

[27]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[28]  Roger K. Moore Computer Speech and Language , 1986 .

[29]  Pietro Laface,et al.  Compensation of Nuisance Factors for Speaker and Language Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Sridha Sridharan,et al.  Explicit modelling of session variability for speaker verification , 2008, Comput. Speech Lang..

[31]  Sadaoki Furui,et al.  An Overview of Speaker Recognition Technology , 1996 .

[32]  Yongmin Li,et al.  Video classification using spatial-temporal features and PCA , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[33]  Driss Matrouf,et al.  Factor analysis and SVM for language recognition , 2009, INTERSPEECH.

[34]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[35]  Shih-Fu Chang,et al.  Survey of compressed-domain features used in audio-visual indexing and analysis , 2003, J. Vis. Commun. Image Represent..

[36]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[37]  Patrick Kenny,et al.  Factor analysis simplified [speaker verification applications] , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[38]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[39]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[40]  William M. Campbell,et al.  Language recognition with support vector machines , 2004, Odyssey.

[41]  Sridha Sridharan,et al.  Modelling session variability in text-independent speaker verification , 2005, INTERSPEECH.