Generative Modeling of Pseudo-Whisper for Robust Whispered Speech Recognition

Whisper is a common means of communication used to avoid disturbing individuals or to exchange private information. As a vocal style, whisper would be an ideal candidate for human-handheld/computer interactions in open-office or public area scenarios. Unfortunately, current speech technology is predominantly focused on modal (neutral) speech and completely breaks down when exposed to whisper. One of the major barriers for successful whisper recognition engines is the lack of available large transcribed whispered speech corpora. This study introduces two strategies that require only a small amount of untranscribed whisper samples to produce excessive amounts of whisper-like (pseudo-whisper) utterances from easily accessible modal speech recordings. Once generated, the pseudo-whisper samples are used to adapt modal acoustic models of a speech recognizer toward whisper. The first strategy is based on Vector Taylor Series (VTS) where a whisper “background” model is first trained to capture a rough estimate of global whisper characteristics from a small amount of actual whisper data. Next, that background model is utilized in the VTS to establish specific broad phone classes' (unvoiced/voiced phones) transformations from each input modal utterance to its pseudo-whispered version. The second strategy generates pseudo-whisper samples by means of denoising autoencoders (DAE). Two generative models are investigated-one produces pseudo-whisper cepstral features on a frame-by-frame basis, while the second generates pseudo-whisper statistics for whole phone segments. It is shown that word error rates of a TIMIT-trained speech recognizer are considerably reduced for a whisper recognition task with a constrained lexicon after adapting the acoustic model toward the VTS or DAE pseudo-whisper samples, compared to model adaptation on an available small whisper set.

[1]  Philip C. Woodland,et al.  Experiments in speaker normalisation and adaptation for large vocabulary speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Boon Pang Lim,et al.  Computational differences between whispered and non-whispered speech , 2011 .

[3]  John H. L. Hansen,et al.  Unsupervised equalization of Lombard effect for speech recognition in noisy adverse environment , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  John H. L. Hansen,et al.  UT-Vocal Effort II: Analysis and constrained-lexicon recognition of whispered speech , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[6]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[8]  John H. L. Hansen,et al.  Acoustic analysis for speaker identification of whispered speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  D. B. Paul A speaker-stress resistant HMM isolated word recognizer , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  John H. L. Hansen,et al.  Acoustic analysis and feature transformation from neutral to whisper for speaker identification within whispered speech audio streams , 2013, Speech Commun..

[11]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[12]  Dorde T. Grozdic,et al.  Application of inverse filtering in enhancement of whisper recognition , 2014, 12th Symposium on Neural Network Applications in Electrical Engineering (NEUREL).

[13]  Mark A. Clements,et al.  Reconstruction of speech from whispers , 2002, MAVEBA.

[14]  John H. L. Hansen,et al.  Model and feature based compensation for whispered speech recognition , 2014, INTERSPEECH.

[15]  Nathalie Henrich,et al.  Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise? , 2014, Comput. Speech Lang..

[16]  Mark J. F. Gales,et al.  Variance compensation within the MLLR framework for robust speech recognition and speaker adaptation , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[17]  John H. L. Hansen,et al.  Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition , 1996, Speech Commun..

[18]  John H. L. Hansen,et al.  Morphological constrained feature enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and Lombard effect , 1994, IEEE Trans. Speech Audio Process..

[19]  John H. L. Hansen,et al.  Advancements in whisper-island detection within normally phonated audio streams , 2009, INTERSPEECH.

[20]  John H. L. Hansen,et al.  Speaker Identification Within Whispered Speech Audio Streams , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  John H. L. Hansen,et al.  Lombard effect compensation for robust automatic speech recognition in noise , 1990, ICSLP.

[22]  Rajesh M. Hegde,et al.  Significance of parametric spectral ratio methods in detection and recognition of whispered speech , 2012, EURASIP J. Adv. Signal Process..

[23]  Bin Ma,et al.  A whispered Mandarin corpus for speech technology applications , 2014, INTERSPEECH.

[24]  Li Deng,et al.  HMM adaptation using vector taylor series for noisy speech recognition , 2000, INTERSPEECH.

[25]  Pedro J. Moreno,et al.  Speech recognition in noisy environments , 1996 .

[26]  Kazuya Takeda,et al.  Acoustic analysis and recognition of whispered speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  Jonas Beskow,et al.  Wavesurfer - an open source speech tool , 2000, INTERSPEECH.

[28]  H. Traunmüller,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Comparative Study of the Male and Female Whispered and Phonated Versions of the Long Vowels of Swedish , 2022 .

[29]  Maëva Garnier Communicating in noisy environments : from adaptation to vocal loading , 2007 .

[30]  I. Mcloughlin,et al.  A comprehensive vowel space for whispered speech. , 2012, Journal of voice : official journal of the Voice Foundation.

[31]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[32]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[33]  Hideki Kasuya,et al.  Acoustic nature of the whisper , 1999, EUROSPEECH.

[34]  Kazuya Takeda,et al.  Acoustic analysis and recognition of whispered speech , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[35]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[36]  Kazuya Takeda,et al.  Analysis and recognition of whispered speech , 2005, Speech Commun..

[37]  John H. L. Hansen,et al.  Generative modeling of pseudo-target domain adaptation samples for whispered speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  John H. L. Hansen,et al.  A comparative study of traditional and newly proposed features for recognition of speech under stress , 2000, IEEE Trans. Speech Audio Process..

[39]  Tetsuji Ogawa,et al.  Influence of Lombard Effect: Accuracy Analysis of Simulation-Based Assessments of Noisy Speech Recognition Systems for Various Recognition Conditions , 2009, IEICE Trans. Inf. Syst..

[40]  Chi Zhang,et al.  Microphone array processing for distance speech capture: A probe study on whisper speech detection , 2010, 2010 Conference Record of the Forty Fourth Asilomar Conference on Signals, Systems and Computers.

[41]  Liang Lu,et al.  Noise-robust whispered speech recognition using a non-audible-murmur microphone with VTS compensation , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[42]  Tanja Schultz,et al.  Whispery speech recognition using adapted articulatory features , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[43]  Carlos Busso,et al.  Lipreading approach for isolated digits recognition under whisper and neutral speech , 2014, INTERSPEECH.

[44]  Yasuo Horiuchi,et al.  Reverberant speech recognition based on denoising autoencoder , 2013, INTERSPEECH.

[45]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[46]  Jeesun Kim,et al.  Comparing the consistency and distinctiveness of speech produced in quiet and in noise , 2014, Comput. Speech Lang..

[47]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[48]  John H. L. Hansen,et al.  UTDrive: Emotion and Cognitive Load Classification for In-Vehicle Scenarios , 2011 .

[49]  James R. Glass,et al.  Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  John H. L. Hansen,et al.  N-channel hidden Markov models for combined stressed speech classification and recognition , 1999, IEEE Trans. Speech Audio Process..

[51]  D G Childers,et al.  Vocal quality factors: analysis, synthesis, and perception. , 1991, The Journal of the Acoustical Society of America.

[52]  W. Heeren,et al.  Perception of prosody in normal and whispered French. , 2014, The Journal of the Acoustical Society of America.

[53]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[54]  Martin Cooke,et al.  Spectral and temporal changes to speech produced in the presence of energetic and informational maskers. , 2010, The Journal of the Acoustical Society of America.

[55]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[56]  Evandro B. Gouvêa,et al.  Speaker normalization through formant-based warping of the frequency scale , 1997, EUROSPEECH.

[57]  John H. L. Hansen,et al.  ICARUS: Source generator based real-time recognition of speech in noisy stressful and Lombard effect environments , 1995, Speech Commun..

[58]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .