Model and feature based compensation for whispered speech recognition

This study proposes model and feature based strategies for automatic whispered speech recognition. Our goal is to compensate for the mismatch between neutral-trained recognizer models and parameters of whispered speech. We propose a pseudo-whisper generation from neutral speech samples for efficient acoustic model adaptation. The scheme is based on the popular Vector Taylor Series (VTS) algorithm. In the first step, a ‘background’ model capturing a rough estimate of the target whispered speech characteristics from a small amount of whispered data is trained. Second, the target background model is utilized in the VTS strategy to establish broad phone classes (consonants and vowels) transformations for individual neutral utterances and transform them towards whisper. Finally, these pseudo-whisper samples are used to adapt neutral recognizer models towards whisper. This approach is evaluated together with Vocal Tract Length Normalization (VTLN) and Shift frequency transforms and show to greatly benefit recognition performance compared to a traditional whisper-adaptation approach. The absolute WER on the closed speakers whisper scenario has been reduced from 17.3% to 8.4% and the open speakers scenario from 27.7% to 17.5%. Index Terms: whispered speech recognition, Vector Taylor Series, vocal length normalization

[1]  John H. L. Hansen,et al.  Speaker Identification Within Whispered Speech Audio Streams , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[3]  Saeed Vaseghi,et al.  Speech recognition in noisy environments , 1992, ICSLP.

[4]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5]  Kazuya Takeda,et al.  Analysis and recognition of whispered speech , 2005, Speech Commun..

[6]  Philip C. Woodland,et al.  Experiments in speaker normalisation and adaptation for large vocabulary speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Kazuya Takeda,et al.  Acoustic analysis and recognition of whispered speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Boon Pang Lim,et al.  Computational differences between whispered and non-whispered speech , 2011 .

[9]  Kazuya Takeda,et al.  Acoustic analysis and recognition of whispered speech , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[10]  Chi Zhang,et al.  Microphone array processing for distance speech capture: A probe study on whisper speech detection , 2010, 2010 Conference Record of the Forty Fourth Asilomar Conference on Signals, Systems and Computers.

[11]  Liang Lu,et al.  Noise-robust whispered speech recognition using a non-audible-murmur microphone with VTS compensation , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[12]  Rajesh M. Hegde,et al.  Significance of parametric spectral ratio methods in detection and recognition of whispered speech , 2012, EURASIP J. Adv. Signal Process..

[13]  John H. L. Hansen,et al.  Unsupervised equalization of Lombard effect for speech recognition in noisy adverse environment , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  John H. L. Hansen,et al.  UT-Vocal Effort II: Analysis and constrained-lexicon recognition of whispered speech , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  John H. L. Hansen,et al.  Advancements in whisper-island detection within normally phonated audio streams , 2009, INTERSPEECH.

[16]  H. Traunmüller,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Comparative Study of the Male and Female Whispered and Phonated Versions of the Long Vowels of Swedish , 2022 .

[17]  John H. L. Hansen,et al.  Acoustic analysis and feature transformation from neutral to whisper for speaker identification within whispered speech audio streams , 2013, Speech Commun..

[18]  Li Deng,et al.  HMM adaptation using vector taylor series for noisy speech recognition , 2000, INTERSPEECH.

[19]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[20]  John H. L. Hansen,et al.  Acoustic analysis for speaker identification of whispered speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.