Whispered speech recognition using deep denoising autoencoder

Abstract Recently Deep Denoising Autoencoders (DDAE) have shown state-of-the-art performance on various machine learning tasks. In this paper, the authors extended this approach to whispered speech recognition which is one of the most challenging problems in Automatic Speech Recognition (ASR). Namely, due to the profound differences between acoustic characteristics of neutral and whispered speech, the performance of traditional ASR systems trained on neutral speech degrades significantly when whisper is applied. This mismatch between training and testing is successfully alleviated with the new proposed system based on deep learning, where DDAE is applied for generating whisper-robust cepstral features. This system was tested and compared in terms of word recognition accuracy with conventional Hidden Markov Model (HMM) speech recognizer in an isolated word recognition task with a real database of whispered speech (WhiSpe). Three types of cepstral coefficients were used in the experiments: MFCC (Mel-Frequency Cepstral Coefficients), TECC (Teager-Energy Cepstral Coefficients) and TEMFCC (Teager-based Mel-Frequency Cepstral Coefficients). The experimental results showed that the proposed system significantly improves whisper recognition accuracy and outperforms traditional HMM-MFCC baseline, resulting in an absolute 31% improvement of whisper recognition accuracy. The highest word recognition rate of 92.81% in whispered speech was achieved with TECC feature.

[1]  John H. L. Hansen,et al.  UT-Vocal Effort II: Analysis and constrained-lexicon recognition of whispered speech , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Ismail Shahin,et al.  Speaker identification in emotional talking environments based on CSPHMM2s , 2013, Eng. Appl. Artif. Intell..

[3]  J. F. Kaiser,et al.  On a simple algorithm to calculate the 'energy' of a signal , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[4]  Christian Werner,et al.  Application of inverse filtering on lidar signals , 1999, Remote Sensing.

[5]  Yi Jiang,et al.  Auditory features based on Gammatone filters for robust speech recognition , 2013, 2013 IEEE International Symposium on Circuits and Systems (ISCAS2013).

[6]  Tanja Schultz,et al.  Adaptation for soft whisper recognition using a throat microphone , 2004, INTERSPEECH.

[7]  Dorde T. Grozdic,et al.  Whispered Speech Database: Design, Processing and Application , 2013, TSD.

[8]  Chi Zhang,et al.  Whisper-Island Detection Based on Unsupervised Segmentation With Entropy-Based Speech Feature Processing , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Mark A. Clements,et al.  Enhancement and recognition of whispered speech , 2003 .

[10]  K. Kallail,et al.  Formant-frequency differences between isolated whispered and phonated vowel samples produced by adult female subjects. , 1984, Journal of speech and hearing research.

[11]  Ismail Shahin,et al.  Employing both gender and emotion cues to enhance speaker identification performance in emotional talking environments , 2013, International Journal of Speech Technology.

[12]  Tatsuya Kawahara,et al.  Reverberant speech recognition combining deep neural networks and deep autoencoders augmented with a phone-class feature , 2015, EURASIP J. Adv. Signal Process..

[13]  Dorde T. Grozdic,et al.  Application of inverse filtering in enhancement of whisper recognition , 2014, 12th Symposium on Neural Network Applications in Electrical Engineering (NEUREL).

[14]  Kazuya Takeda,et al.  Analysis and recognition of whispered speech , 2005, Speech Commun..

[15]  Bin Ma,et al.  A whispered Mandarin corpus for speech technology applications , 2014, INTERSPEECH.

[16]  D. T. Grozdic,et al.  Application of neural networks in whispered speech recognition , 2012, 2012 20th Telecommunications Forum (TELFOR).

[17]  Boon Pang Lim,et al.  Computational differences between whispered and non-whispered speech , 2011 .

[18]  Liang Lu,et al.  Noise-robust whispered speech recognition using a non-audible-murmur microphone with VTS compensation , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[19]  James F. Kaiser,et al.  Some useful properties of Teager's energy operators , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  John H. L. Hansen,et al.  Model and feature based compensation for whispered speech recognition , 2014, INTERSPEECH.

[21]  John H. L. Hansen,et al.  Advancements in whisper-island detection using the linear predictive residual , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  S. Jovicic,et al.  Acoustic analysis of consonants in whispered speech. , 2008, Journal of voice : official journal of the Voice Foundation.

[23]  John H. L. Hansen,et al.  Generative modeling of pseudo-target domain adaptation samples for whispered speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Carlos Busso,et al.  Lipreading approach for isolated digits recognition under whisper and neutral speech , 2014, INTERSPEECH.

[25]  Petros Maragos,et al.  Auditory Teager energy cepstrum coefficients for robust speech recognition , 2005, INTERSPEECH.

[26]  John H. L. Hansen,et al.  Speaker Identification Within Whispered Speech Audio Streams , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[28]  H. Teager Some observations on oral air flow during phonation , 1980 .

[29]  Carlos Busso,et al.  Audiovisual corpus to analyze whisper speech , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  John H. L. Hansen,et al.  Classification of speech under stress based on features derived from the nonlinear Teager energy operator , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[31]  Panikos Heracleous Using Teager Energy Cepstrum and HMM distancesin Automatic Speech Recognition and Analysis of Unvoiced Speech , 2009 .

[32]  John H. L. Hansen,et al.  Analysis and classification of speech mode: whispered through shouted , 2007, INTERSPEECH.

[33]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[34]  J. Hansen,et al.  Advanced Feature Normalization and Rapid Model Adaptation for Robust In-Vehicle Speech Recognition , 2013 .

[35]  Slobodan Jovicic,et al.  HTK-Based Recognition of Whispered Speech , 2014, SPECOM.

[36]  Rajesh M. Hegde,et al.  Significance of parametric spectral ratio methods in detection and recognition of whispered speech , 2012, EURASIP J. Adv. Signal Process..

[37]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.