UT-Vocal Effort II: Analysis and constrained-lexicon recognition of whispered speech

This study focuses on acoustic variations in speech introduced by whispering, and proposes several strategies to improve robustness of automatic speech recognition of whispered speech with neutral-trained acoustic models. In the analysis part, differences in neutral and whispered speech captured in the UT-Vocal Effort II corpus are studied in terms of energy, spectral slope, and formant center frequency and bandwidth distributions in silence, voiced, and unvoiced speech signal segments. In the part dedicated to speech recognition, several strategies involving front-end filter bank redistribution, cepstral dimensionality reduction, and lexicon expansion for alternative pronunciations are proposed. The proposed neutral-trained system employing redistributed filter bank and reduced features provides a 7.7 % absolute WER reduction over the baseline system trained on neutral speech, and a 1.3 % reduction over a baseline system with whisper-adapted acoustic models.

[1]  Boon Pang Lim,et al.  Computational differences between whispered and non-whispered speech , 2011 .

[2]  Kai Feng,et al.  Approaches to automatic lexicon learning with limited training examples , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  John H. L. Hansen,et al.  Advancements in whisper-island detection within normally phonated audio streams , 2009, INTERSPEECH.

[4]  H. Traunmüller,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Comparative Study of the Male and Female Whispered and Phonated Versions of the Long Vowels of Swedish , 2022 .

[5]  Martin Cooke,et al.  Spectral and temporal changes to speech produced in the presence of energetic and informational maskers. , 2010, The Journal of the Acoustical Society of America.

[6]  John H. L. Hansen,et al.  Acoustic analysis for speaker identification of whispered speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Petr Fousek,et al.  Data-driven design of front-end filter bank for Lombard speech recognition , 2006, INTERSPEECH.

[8]  John H. L. Hansen,et al.  A comparative study of traditional and newly proposed features for recognition of speech under stress , 2000, IEEE Trans. Speech Audio Process..

[9]  Tetsuji Ogawa,et al.  Influence of Lombard Effect: Accuracy Analysis of Simulation-Based Assessments of Noisy Speech Recognition Systems for Various Recognition Conditions , 2009, IEICE Trans. Inf. Syst..

[10]  Chi Zhang,et al.  Microphone array processing for distance speech capture: A probe study on whisper speech detection , 2010, 2010 Conference Record of the Forty Fourth Asilomar Conference on Signals, Systems and Computers.

[11]  Liang Lu,et al.  Noise-robust whispered speech recognition using a non-audible-murmur microphone with VTS compensation , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[12]  Kazuya Takeda,et al.  Analysis and recognition of whispered speech , 2005, Speech Commun..

[13]  D. B. Paul A speaker-stress resistant HMM isolated word recognizer , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Rajesh M. Hegde,et al.  Significance of parametric spectral ratio methods in detection and recognition of whispered speech , 2012, EURASIP J. Adv. Signal Process..

[15]  Jeesun Kim,et al.  Comparing the consistency and distinctiveness of speech produced in quiet and in noise , 2014, Comput. Speech Lang..

[16]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[17]  Kazuya Takeda,et al.  Acoustic analysis and recognition of whispered speech , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[18]  John H. L. Hansen,et al.  Acoustic analysis and feature transformation from neutral to whisper for speaker identification within whispered speech audio streams , 2013, Speech Commun..

[19]  John H. L. Hansen,et al.  Speaker Identification Within Whispered Speech Audio Streams , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Nathalie Henrich,et al.  Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise? , 2014, Comput. Speech Lang..

[21]  John H. L. Hansen,et al.  Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition , 1996, Speech Commun..

[22]  Kazuya Takeda,et al.  Acoustic analysis and recognition of whispered speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[24]  Jonas Beskow,et al.  Wavesurfer - an open source speech tool , 2000, INTERSPEECH.