A novel framework for noise robust ASR using cochlear implant-like spectrally reduced speech

We propose a novel framework for noise robust automatic speech recognition (ASR) based on cochlear implant-like spectrally reduced speech (SRS). Two experimental protocols (EPs) are proposed in order to clarify the advantage of using SRS for noise robust ASR. These two EPs assess the SRS in both the training and testing environments. Speech enhancement was used in one of two EPs to improve the quality of testing speech. In training, SRS is synthesized from original clean speech whereas in testing, SRS is synthesized directly from noisy speech or from enhanced speech signals. The synthesized SRS is recognized with the ASR systems trained on SRS signals, with the same synthesis parameters. Experiments show that the ASR results, in terms of word accuracy, calculated with ASR systems using SRS, are significantly improved compared to the baseline non-SRS ASR systems. We propose also a measure of the training and testing mismatch based on the Kullback-Leibler divergence. The numerical results show that using the SRS in ASR systems helps in reducing significantly the training and testing mismatch due to environmental noise. The training of the HMM-based ASR systems and the recognition tests were performed by using the HTK toolkit and the Aurora 2 speech database.

[1]  L. Lamel,et al.  Large-vocabulary continuous speech recognition: advances and applications , 2000, Proceedings of the IEEE.

[2]  Schuyler Quackenbush,et al.  Objective measures of speech quality , 1995 .

[3]  Jacob Benesty,et al.  Fundamentals of Noise Reduction , 2008 .

[4]  A. Nadas,et al.  A decision theorectic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood , 1983 .

[5]  John H. L. Hansen,et al.  Speech Enhancement Based on Generalized Minimum Mean Square Error Estimators and Masking Properties of the Auditory System , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[7]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[8]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[9]  Biing-Hwang Juang,et al.  A family of distortion measures based upon projection operation for robust speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[10]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[11]  Philipos C. Loizou,et al.  A noise-estimation algorithm for highly non-stationary environments , 2006, Speech Commun..

[12]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[13]  Sven Nordholm,et al.  Spectral subtraction using reduced delay convolution and adaptive averaging , 2001, IEEE Trans. Speech Audio Process..

[14]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[15]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[16]  Christophe Beaugeant,et al.  SPEECH ENHANCEMENT USING A MINIMUM LEAST SQUARE AMPLITUDE ESTIMATOR , 2001 .

[17]  E. Ambikairajah,et al.  Speech Enhancement using Temporal Masking and Fractional Bark Gammatone Filters , 2004 .

[18]  Shrikanth S. Narayanan,et al.  Upper Bound Kullback–Leibler Divergence for Transient Hidden Markov Models , 2008, IEEE Transactions on Signal Processing.

[19]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[20]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[21]  Shrikanth S. Narayanan,et al.  Average divergence distance as a statistical discrimination measure for hidden Markov models , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[23]  I. Cohen,et al.  Noise estimation by minima controlled recursive averaging for robust speech enhancement , 2002, IEEE Signal Processing Letters.

[24]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[25]  Mark J. F. Gales,et al.  Model-based techniques for noise robust speech recognition , 1995 .

[26]  Yi Hu,et al.  A generalized subspace approach for enhancing speech corrupted by colored noise , 2003, IEEE Trans. Speech Audio Process..

[27]  Gernot Kubin,et al.  On speech coding in a perceptual domain , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[28]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[29]  Benoît Champagne,et al.  Incorporating the human hearing properties in the signal subspace approach for speech enhancement , 2003, IEEE Trans. Speech Audio Process..

[30]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[31]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[32]  Donald B. Rubin,et al.  Max-imum Likelihood from Incomplete Data , 1972 .

[33]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[34]  Dominique Pastor,et al.  On the Recognition of Cochlear Implant-Like Spectrally Reduced Speech With MFCC and HMM-Based ASR , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  P. Loizou Introduction to cochlear implants. , 1999, IEEE engineering in medicine and biology magazine : the quarterly magazine of the Engineering in Medicine & Biology Society.

[36]  Steve Young,et al.  HMMs and related speech recognition technologies , 2008 .

[37]  Mark J. F. Gales Predictive model-based compensation schemes for robust speech recognition , 1998, Speech Commun..

[38]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[39]  Kuldip K. Paliwal,et al.  Feature extraction from higher-lag autocorrelation coefficients for robust speech recognition , 2006, Speech Commun..

[40]  R V Shannon,et al.  Speech Recognition with Primarily Temporal Cues , 1995, Science.