A unified approach for audio characterization and its application to speaker recognition

Systems designed to solve speech processing tasks like speech or speaker recognition, language identification, or emotion detection are known to be affected by the recording conditions of the acoustic signal, like the channel, background noise, reverberation, and so on. Knowledge of the nuisance characteristics present in the signal can be used to improve performance of the system. In some cases, the nature of these nuisance characteristics is known a priori, but in most practical cases it is not. Most approaches used to automatically detect the characteristics of a signal are designed for a specific type of effect: noise, reverberation, language, type of channel, and so on. We propose a method for detecting the audio characteristics of a signal in a unified way, based on iVectors. We show results for the detector itself and for its use as metadata during calibration of a state-ofthe-art speaker recognition system based on iVectors extracted from Mel frequency cepstral coefficients. Results show relative gains in equal error rate of up to 15% in a variety of recording conditions.

[1]  Andreas Stolcke,et al.  Noise Robust Speaker Identification for Spontaneous Arabic Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[2]  S. Sagayama,et al.  Analytic Methods for Acoustic Model Adaptation : A Review , 2001 .

[3]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[4]  Lukás Burget,et al.  Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Richard M. Stern,et al.  Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis , 2008, INTERSPEECH.

[7]  Colleen Richey,et al.  Effects of vocal effort and speaking style on text-independent speaker verification , 2008, INTERSPEECH.

[8]  Elizabeth Shriberg,et al.  System combination using auxiliary information for speaker verification , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[10]  Sachin S. Kajarekar,et al.  Class-dependent score combination for speaker recognition , 2005, INTERSPEECH.

[11]  Martin Graciarena,et al.  Robust feature compensation in nonstationary and multiple noise environments , 2005, INTERSPEECH.

[12]  L. Burget,et al.  Promoting robustness for speaker modeling in the community: the PRISM evaluation set , 2011 .

[13]  Jan Vaněk,et al.  UWB system description for NIST SRE 2010 , 2010 .

[14]  Douglas L. Jones,et al.  Blind estimation of reverberation time. , 2003, The Journal of the Acoustical Society of America.

[15]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[16]  Brian Kingsbury,et al.  Recognizing reverberant speech with RASTA-PLP , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.