Speaker verification in sensor and acoustic environment mismatch conditions

Our initial speaker verification study exploring the impact of mismatch in training and test conditions finds that the mismatch in sensor and acoustic environment results in significant performance degradation compared to other mismatches like language and style (Haris et al. in Int. J. Speech Technol., 2012). In this work we present a method to suppress the mismatch between the training and test speech, specifically due to sensor and acoustic environment. The method is based on identifying and emphasizing more speaker specific and less mismatch affected vowel-like regions (VLRs) compared to the other speech regions. VLRs are separated from the speech regions (regions detected using voice activity detection (VAD)) using VLR onset point (VLROP) and are processed independently during training and testing of the speaker verification system. Finally, the scores are combined with more weight to that generated by VLRs as those are relatively more speaker specific and less mismatch affected. Speaker verification studies are conducted using the mel-frequency cepstral coefficients (MFCCs) as feature vectors. The speaker modeling is done using the Gaussian mixture model-universal background model and the state-of-the-art i-vector based approach. The experimental results show that for both the systems, proposed approach provides consistent performance improvement on the conversational approach with and without different channel compensation techniques. For instance, with IITG-MV Phase-II dataset for headphone trained and voice recorder test speech, the proposed approach provides a relative improvement of 25.08 % (in EER) for the i-vector based speaker verification systems with LDA and WCCN compared to conventional approach.

[1]  Bayya Yegnanarayana,et al.  Enhancement of reverberant speech using LP residual signal , 2000, IEEE Trans. Speech Audio Process..

[2]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[3]  P. Krishnamoorthy,et al.  Reverberant Speech Enhancement by Temporal and Spectral Processing , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Hynek Hermansky,et al.  Speech enhancement using linear prediction residual , 1999, Speech Commun..

[5]  S. R. Mahadeva Prasanna,et al.  Multivariability speaker recognition database in Indian scenario , 2012, Int. J. Speech Technol..

[6]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[7]  Bayya Yegnanarayana,et al.  Epoch Extraction From Speech Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Bayya Yegnanarayana,et al.  Characterization of Glottal Activity From Speech Signals , 2009, IEEE Signal Processing Letters.

[9]  Mike Brookes,et al.  Estimation of Glottal Closure Instants in Voiced Speech Using the DYPSA Algorithm , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Roland Auckenthaler,et al.  Score Normalization for Text-Independent Speaker Verification Systems , 2000, Digit. Signal Process..

[11]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  VargaAndrew,et al.  Assessment for automatic speech recognition II , 1993 .

[13]  Ramesh A. Gopinath,et al.  Short-time Gaussianization for robust speaker verification , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Hong Kook Kim,et al.  Cepstrum-domain acoustic feature compensation based on decomposition of speech and noise for ASR in noisy environments , 2001, IEEE Trans. Speech Audio Process..

[15]  S. R. Mahadeva Prasanna,et al.  Detection of vowel onset point events using excitation information , 2005, INTERSPEECH.

[16]  S. R. M. Prasanna,et al.  Significance of Vowel-Like Regions for Speaker Verification Under Degraded Conditions , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[18]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[19]  William M. Campbell,et al.  Advances in channel compensation for SVM speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[20]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[21]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Douglas A. Reynolds,et al.  Channel robust speaker verification via feature mapping , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[23]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[24]  J.P. Eatock,et al.  A quantitative assessment of the relative speaker discriminating properties of phonemes , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  B. Yegnanarayana,et al.  Epoch extraction from linear prediction residual for identification of closed glottis interval , 1979 .

[26]  S. R. Mahadeva Prasanna,et al.  Enhancement of noisy speech by temporal and spectral processing , 2011, Speech Commun..

[27]  S. R. Mahadeva Prasanna,et al.  Vowel Onset Point Detection Using Source, Spectral Peaks, and Modulation Spectrum Energies , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  S. R. Mahadeva Prasanna,et al.  Speaker verification under degraded condition: a perceptual study , 2011, Int. J. Speech Technol..

[29]  S R M Prasanna,et al.  Multi-variability speech database for robust speaker recognition , 2011, 2011 National Conference on Communications (NCC).