Whispered Speech Recognition Using Deep Denoising Autoencoder and Inverse Filtering

Due to the profound differences between acoustic characteristics of neutral and whispered speech, the performance of traditional automatic speech recognition (ASR) systems trained on neutral speech degrades significantly when whisper is applied. In order to deeply analyze this mismatched train/test situation and to develop an efficient way for whisper recognition, this study first analyzes acoustic characteristics of whispered speech, addresses the problems of whispered speech recognition in mismatched conditions, and then proposes a new robust cepstral features and preprocessing approach based on deep denoising autoencoder (DDAE) that enhance whisper recognition. The experimental results confirm that Teager-energy-based cepstral features, especially TECCs, are more robust and better whisper descriptors than traditional Mel-frequency cepstral coefficients (MFCC). Further detailed analysis of cepstral distances, distributions of cepstral coefficients, confusion matrices, and experiments with inverse filtering, prove that voicing in speech stimuli is the main cause of word misclassification in mismatched train/test scenarios. The new framework based on DDAE and TECC feature, significantly improves whisper recognition accuracy and outperforms traditional MFCC and GMM-HMM (Gaussian mixture density—Hidden Markov model) baseline, resulting in an absolute 31% improvement of whisper recognition accuracy. The achieved word recognition rate in neutral/whisper scenario is 92.81%.

[1]  John H. L. Hansen,et al.  Classification of speech under stress based on features derived from the nonlinear Teager energy operator , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Tanja Schultz,et al.  Adaptation for soft whisper recognition using a throat microphone , 2004, INTERSPEECH.

[3]  Dorde T. Grozdic,et al.  Whispered Speech Database: Design, Processing and Application , 2013, TSD.

[4]  John H. L. Hansen,et al.  Speaker Identification Within Whispered Speech Audio Streams , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  P. Woodland,et al.  Flexible speaker adaptation using maximum likelihood linear regression , 1995 .

[6]  Panikos Heracleous Using Teager Energy Cepstrum and HMM distancesin Automatic Speech Recognition and Analysis of Unvoiced Speech , 2009 .

[7]  John H. L. Hansen,et al.  UT-Vocal Effort II: Analysis and constrained-lexicon recognition of whispered speech , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Eivind Kvedalen Signal processing using the Teager Energy Operator and other nonlinear operators , 2003 .

[9]  S. Jovi Serbian emotional speech database : design , processing and evaluation , 2004 .

[10]  Kazuya Takeda,et al.  Analysis and recognition of whispered speech , 2005, Speech Commun..

[11]  H. Teager Some observations on oral air flow during phonation , 1980 .

[12]  Bin Ma,et al.  A whispered Mandarin corpus for speech technology applications , 2014, INTERSPEECH.

[13]  Carlos Busso,et al.  Lipreading approach for isolated digits recognition under whisper and neutral speech , 2014, INTERSPEECH.

[14]  D. T. Grozdic,et al.  Application of neural networks in whispered speech recognition , 2012, 2012 20th Telecommunications Forum (TELFOR).

[15]  Boon Pang Lim,et al.  Computational differences between whispered and non-whispered speech , 2011 .

[16]  John H. L. Hansen,et al.  Advancements in whisper-island detection using the linear predictive residual , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Slobodan Jovicic,et al.  HTK-Based Recognition of Whispered Speech , 2014, SPECOM.

[18]  Chi Zhang,et al.  Whisper-Island Detection Based on Unsupervised Segmentation With Entropy-Based Speech Feature Processing , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[20]  Petros Maragos,et al.  Energy separation in signal modulations with application to speech analysis , 1993, IEEE Trans. Signal Process..

[21]  Mark A. Clements,et al.  Enhancement and recognition of whispered speech , 2003 .

[22]  John H. L. Hansen,et al.  Generative Modeling of Pseudo-Whisper for Robust Whispered Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  S. Jovicic,et al.  Acoustic analysis of consonants in whispered speech. , 2008, Journal of voice : official journal of the Voice Foundation.

[24]  Carlos Busso,et al.  Audiovisual corpus to analyze whisper speech , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Liang Lu,et al.  Noise-robust whispered speech recognition using a non-audible-murmur microphone with VTS compensation , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[26]  John H. L. Hansen,et al.  Generative modeling of pseudo-target domain adaptation samples for whispered speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  K. Kallail,et al.  Formant-frequency differences between isolated whispered and phonated vowel samples produced by adult female subjects. , 1984, Journal of speech and hearing research.

[28]  Petros Maragos,et al.  Auditory Teager energy cepstrum coefficients for robust speech recognition , 2005, INTERSPEECH.

[29]  Michael Vorländer,et al.  Handbook of signal processing in acoustics , 2008 .

[30]  John H. L. Hansen,et al.  Analysis and classification of speech mode: whispered through shouted , 2007, INTERSPEECH.

[31]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[32]  James F. Kaiser,et al.  Some useful properties of Teager's energy operators , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  John H. L. Hansen,et al.  Model and feature based compensation for whispered speech recognition , 2014, INTERSPEECH.

[34]  Dorde T. Grozdic,et al.  Application of inverse filtering in enhancement of whisper recognition , 2014, 12th Symposium on Neural Network Applications in Electrical Engineering (NEUREL).

[35]  Rajesh M. Hegde,et al.  Significance of parametric spectral ratio methods in detection and recognition of whispered speech , 2012, EURASIP J. Adv. Signal Process..