Significance of parametric spectral ratio methods in detection and recognition of whispered speech

In this article the significance of a new parametric spectral ratio method that can be used to detect whispered speech segments within normally phonated speech is described. Adaptation methods based on the maximum likelihood linear regression (MLLR) are then used to realize a mismatched train-test style speech recognition system. This proposed parametric spectral ratio method computes a ratio spectrum of the linear prediction (LP) and the minimum variance distortion-less response (MVDR) methods. The smoothed ratio spectrum is then used to detect whispered segments of speech within neutral speech segments effectively. The proposed LP-MVDR ratio method exhibits robustness at different SNRs as indicated by the whisper diarization experiments conducted on the CHAINS and the cell phone whispered speech corpus. The proposed method also performs reasonably better than the conventional methods for whisper detection. In order to integrate the proposed whisper detection method into a conventional speech recognition engine with minimal changes, adaptation methods based on the MLLR are used herein. The hidden Markov models corresponding to neutral mode speech are adapted to the whispered mode speech data in the whispered regions as detected by the proposed ratio method. The performance of this method is first evaluated on whispered speech data from the CHAINS corpus. The second set of experiments are conducted on the cell phone corpus of whispered speech. This corpus is collected using a set up that is used commercially for handling public transactions. The proposed whisper speech recognition system exhibits reasonably better performance when compared to several conventional methods. The results shown indicate the possibility of a whispered speech recognition system for cell phone based transactions.

[1]  H. Acquah Comparison of Akaike information criterion (AIC) and Bayesian information criterion (BIC) in selection of an asymmetric price relationship , 2010 .

[2]  Bhaskar D. Rao,et al.  All-pole modeling of speech based on the minimum variance distortionless response spectrum , 2000, Conference Record of the Thirty-First Asilomar Conference on Signals, Systems and Computers (Cat. No.97CB36136).

[3]  Bin Ma,et al.  Analysis and Selection of Prosodic Features for Language Identification , 2009, 2009 International Conference on Asian Language Processing.

[4]  V. Tartter What's in a whisper? , 1989, The Journal of the Acoustical Society of America.

[5]  Satya Dharanipragada,et al.  Perceptual MVDR-based cepstral coefficients (PMCCs) for robust speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[6]  Simon King,et al.  Sixth International Conference on Spoken Language Processing (ICSLP 2000) , 2000 .

[7]  M. Wolfel,et al.  Minimum variance distortionless response spectral estimation , 2005, IEEE Signal Processing Magazine.

[8]  P. J. Sherman,et al.  On the family of ML spectral estimates for mixed spectrum identification , 1991, IEEE Trans. Signal Process..

[9]  John H. L. Hansen,et al.  Analysis and classification of speech mode: whispered through shouted , 2007, INTERSPEECH.

[10]  Michael A. Carlin,et al.  Unsupervised detection of whispered speech in the presence of normal phonation , 2006, INTERSPEECH.

[11]  Mauro Cettolo,et al.  Evaluation of BIC-based algorithms for audio segmentation , 2005, Comput. Speech Lang..

[12]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[13]  A. Gray,et al.  A spectral-flatness measure for studying the autocorrelation method of linear prediction of speech analysis , 1974 .

[14]  Mark J. F. Gales,et al.  The generation and use of regression class trees for MLLR adaptation , 1996 .

[15]  Zheng Fang,et al.  Comparison of different implementations of MFCC , 2001 .

[16]  John H. L. Hansen,et al.  Advancements in whisper-island detection using the linear predictive residual , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  H. Acquah,et al.  A bootstrap approach to evaluating the performance of Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) in selection of an asymmetric price relationship. , 2012 .

[18]  Sanaz Seyedin,et al.  Robust MVDR-based feature extraction for speech recognition , 2009, 2009 7th International Conference on Information, Communications and Signal Processing (ICICS).

[19]  Hsin-Min Wang,et al.  BIC-Based Speaker Segmentation Using Divide-and-Conquer Strategies With Application to Speaker Diarization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Ramesh A. Gopinath,et al.  Improved speaker segmentation and segments clustering using the bayesian information criterion , 1999, EUROSPEECH.

[21]  Bhaskar D. Rao,et al.  MVDR based all-pole models for spectral coding of speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[22]  Chi Zhang,et al.  Effective Segmentation based on Vocal Effort Change Point Detection 1 , 2008 .

[23]  Keiichi Tokuda,et al.  Speaker adaptation for HMM-based speech synthesis system using MLLR , 1998, SSW.

[24]  Bhaskar D. Rao,et al.  MVDR based feature extraction for robust speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[25]  Bhaskar D. Rao,et al.  Minimum variance distortionless response (MVDR) modeling of voiced speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  Chi Zhang,et al.  Whisper-Island Detection Based on Unsupervised Segmentation With Entropy-Based Speech Feature Processing , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  V. Petrushin,et al.  Whispered Speech Prosody Modeling for TTS Synthesis , 2010 .

[28]  FangZheng,et al.  Comparison of different implementations of MFCC , 2001 .

[29]  John H. L. Hansen,et al.  Unsupervised audio stream segmentation and clustering via the Bayesian information criterion , 2000, INTERSPEECH.

[30]  张国亮,et al.  Comparison of Different Implementations of MFCC , 2001 .

[31]  John H. L. Hansen,et al.  Advancements in whisper-island detection within normally phonated audio streams , 2009, INTERSPEECH.

[32]  J. Burg THE RELATIONSHIP BETWEEN MAXIMUM ENTROPY SPECTRA AND MAXIMUM LIKELIHOOD SPECTRA , 1972 .

[33]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[34]  Rajesh M. Hegde,et al.  Significance of the LP-MVDR spectral ratio method in Whisper Detection , 2011, 2011 National Conference on Communications (NCC).

[35]  Stanley J. Wenndt,et al.  A study on the classification of whispered and normally phonated speech , 2002, INTERSPEECH.

[36]  Jan Van der Spiegel,et al.  An acoustic-phonetic feature-based system for the automatic recognition of fricative consonants , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).