Speaker Identification for Whispered Speech Using a Training Feature Transformation from Neutral to Whisper

A number of research studies in speaker recognition have recently focused on robustness due to microphone and channel mismatch(e.g., NIST SRE). However, changes in vocal effort, especially whispered speech, present significant challenges in maintaining system performance. Due to the mismatch spectral structure resulting from the different production mechanisms, performance of speaker identification systems trained with neutral speech degrades significantly when tested with whispered speech. This study considers a feature transformation method in the training phase that leads to a more robust speaker model for speaker ID with whispered speech. In the proposed system, a Speech Mode Independent (SMI) Universal Background Model (UBM) is built using collected real neutral features and pseudo whispered features generated with Vector Taylor Series (VTS), or via Constrained Maximum Likelihood Linear Regression (CMLLR) model adaptation. Text-independent closed set speaker ID results using the UT-VocalEffort II corpus show an accuracy of 88.87% using the proposed method, which represents a relative improvement of 46.26% compared with the 79.29% accuracy of the baseline system. This result confirms a viable approach to improving speaker ID performance for neutral and whispered speech mismatched conditions. Index Terms: whispered speech, speech identification

[1]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2]  Li Deng,et al.  Estimating cepstrum of speech under the presence of noise using a joint prior of static and dynamic features , 2004, IEEE Transactions on Speech and Audio Processing.

[3]  Tanja Schultz,et al.  Whispering Speaker Identification , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[4]  Kazuya Takeda,et al.  Analysis and recognition of whispered speech , 2005, Speech Commun..

[5]  John H. L. Hansen,et al.  Speaker identification for whispered speech based on frequency warping and score competition , 2008, INTERSPEECH.

[6]  John H. L. Hansen,et al.  Acoustic Analysis of Whispered Speech for Phoneme and Speaker Dependency , 2011, INTERSPEECH.

[7]  John H. L. Hansen,et al.  HMM-based stressed speech modeling with application to improved synthesis and recognition of isolated speech under stress , 1998, IEEE Trans. Speech Audio Process..

[8]  John H. L. Hansen,et al.  Advancements in whisper-island detection within normally phonated audio streams , 2009, INTERSPEECH.

[9]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[10]  John H. L. Hansen,et al.  Speaker identification for whispered speech using modified temporal patterns and MFCCs , 2009, INTERSPEECH.

[11]  Hideki Kasuya,et al.  Acoustic nature of the whisper , 1999, EUROSPEECH.