A feature extraction method using subband based periodicity and aperiodicity decomposition with noise robust frontend processing for automatic speech recognition

Abstract This paper proposes a frontend processing technique that employs a speech feature extraction method called Subband based Periodicity and Aperiodicity DEcomposition (SPADE), and examines its validity for automatic speech recognition in noisy environments. SPADE divides speech signals into subband signals, which are then decomposed into their periodic and aperiodic features, and uses both features as speech feature parameters. SPADE employs independent periodicity estimation within each subband and periodicity–aperiodicity decomposition design based on a parallel distributed processing technique motivated by the human speech perception process. Unlike other speech features, this decomposition of speech into two characteristics provides information about periodicities and aperiodicities, and thus allows the utilization of the robustness exhibited by periodic features without losing certain essential information included in aperiodic features. This paper first introduces an implementation of SPADE that operates in the frequency domain, and then examines the validity of combining SPADE with speech enhancement methods. For this examination, we combine SPADE with noise compensation methods that operate in the frequency domain and cepstral normalization methods. In addition, we employ an energy parameter calculation method based on the SPADE framework. An evaluation with the AURORA-2J noisy continuous digit speech recognition database (Japanese AURORA-2) shows that SPADE combined with adaptive Wiener filtering, cepstral normalization, and the energy parameter achieves average word accuracy rates of 82.58% with clean training and 92.55% with multicondition training. These rates are higher than those achieved with ETSI WI008 advanced DSR frontend processing (77.98% and 91.01%, respectively) whose speech feature parameter is based on conventional Mel-frequency cepstral coefficients. By comparison with ETSI WI008 advanced DSR frontend, the proposed method reduces word error rates by 20.9% with clean training and 17.2% with multicondition training. These results confirmed that SPADE combined with noise reduction methods can increase robustness in the presence of noise.

[1]  K Aikawa,et al.  Cepstral representation of speech motivated by time-frequency masking: an application to speech recognition. , 1996, The Journal of the Acoustical Society of America.

[2]  Martin J. Russell,et al.  Covariation and weighting of harmonically decomposed streams for ASR , 2003, INTERSPEECH.

[3]  Kentaro Ishizuka,et al.  Speech feature extraction method representing periodicity and aperiodicity in sub bands for robust speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  B. Moore Frequency Selectivity in Hearing , 1987 .

[5]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[6]  Steven Greenberg,et al.  Speech Processing in the Auditory System: An Overview , 2004 .

[7]  S. Seneff A joint synchrony/mean-rate model of auditory speech processing , 1990 .

[8]  Richard M. Schwartz,et al.  Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.

[9]  B. Moore,et al.  Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. , 1983, The Journal of the Acoustical Society of America.

[10]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[11]  Karim Filali,et al.  Frontend post-processing and backend model enhancement on the Aurora 2.0/3.0 databases , 2002, INTERSPEECH.

[12]  Sadaoki Furui,et al.  A maximum likelihood procedure for a universal adaptation method based on HMM composition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[13]  Satoshi Nakamura,et al.  AURORA-2J: An Evaluation Framework for Japanese Noisy Speech Recognition , 2005, IEICE Trans. Inf. Syst..

[14]  Jan Van der Spiegel,et al.  Speech processing using the average localized synchrony detection , 2000 .

[15]  Yuqing Gao,et al.  Auditory model based speech processing , 1992, ICSLP.

[16]  Frank K. Soong,et al.  An auditory system-based feature for robust speech recognition , 2001, INTERSPEECH.

[17]  C. M. Marin,et al.  Concurrent vowel identification II: Effects of phase, harmonicity and task , 1997 .

[18]  Lawrence R. Rabiner,et al.  On the use of autocorrelation analysis for pitch detection , 1977 .

[19]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[20]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[21]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[22]  Shingo Kuroiwa,et al.  DATA COLLECTION AND EVALUATION OF AURORA-2 JAPANESE CORPUS , 2003 .

[23]  Tomohiro Nakatani,et al.  Improvement in robustness of speech feature extraction method using sub-band based periodicity and aperiodicity decomposition , 2004, INTERSPEECH.

[24]  Alan V. Oppenheim,et al.  All-pole modeling of degraded speech , 1978 .

[25]  Fumitada Itakura,et al.  Robust speech feature extraction using SBCOR analysis , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[26]  Alexis Bernard,et al.  Can back-ends be more robust than front-ends? Investigation over the Aurora-2 database , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  R. Fay,et al.  Speech Processing in the Auditory System , 2010, Springer Handbook of Auditory Research.

[28]  G. W. Hughes,et al.  Minimum Prediction Residual Principle Applied to Speech Recognition , 1975 .

[29]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[30]  Oded Ghitza,et al.  Temporal non-place information in the auditory-nerve firing patterns as a front-end for speech recognition in a noisy environment , 1988 .

[31]  Jérôme Boudy,et al.  Experiments with a nonlinear spectral subtractor (NSS), Hidden Markov models and the projection, for robust speech recognition in cars , 1991, Speech Commun..

[32]  R. Patterson Auditory filter shapes derived with noise stimuli. , 1976, The Journal of the Acoustical Society of America.

[33]  Laurent Mauuary,et al.  Blind equalization in the cepstral domain for robust telephone based speech recognition , 1998, 9th European Signal Processing Conference (EUSIPCO 1998).

[34]  Hynek Hermansky,et al.  Qualcomm-ICSI-OGI features for ASR , 2002, INTERSPEECH.

[35]  Rhee Man Kil,et al.  Auditory processing of speech signals for robust speech recognition in real-world noisy environments , 1999, IEEE Trans. Speech Audio Process..

[36]  B. Moore,et al.  Auditory Frequency Selectivity , 1986, Nato ASI Series.

[37]  Mark J. F. Gales,et al.  HMM recognition in noise using parallel model combination , 1993, EUROSPEECH.

[38]  Denis Jouvet,et al.  Evaluation of a noise-robust DSR front-end on Aurora databases , 2002, INTERSPEECH.

[39]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals , 1983 .