Temporal Feature Selection for Noisy Speech Recognition

Automatic speech recognition systems rely on feature extraction techniques to improve their performance. Static features obtained from each frame are usually enhanced with dynamical components using derivative operations (delta features). However, the susceptibility to noise of the derivative impacts on the accuracy of the recognition in noisy environments. We propose an alternative to the delta features by selecting coefficients from adjacent frames based on frequency. We noticed that consecutive samples were highly correlated at low frequency and more representative dynamics could be incorporated by looking farther away in time. The strategy we developed to perform this frequency-based selection was evaluated on the Aurora 2 continuous-digits and connected-digits tasks using MFCC, PLPCC and LPCC standard features. The results of our experimentations show that our strategy achieved an average relative improvement of \(32.10\%\) in accuracy, with most gains in very noisy environments where the traditional delta features have low recognition rates.

[1]  Ramesh A. Gopinath,et al.  Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Urmila Shrawankar,et al.  Techniques for Feature Extraction In Speech Recognition System : A Comparative Study , 2013, ArXiv.

[3]  Brahim Chaib-draa,et al.  Effects of Frequency-Based Inter-frame Dependencies on Automatic Speech Recognition , 2014, Canadian Conference on AI.

[4]  Jérôme Boudy,et al.  Experiments with a nonlinear spectral subtractor (NSS), Hidden Markov models and the projection, for robust speech recognition in cars , 1991, Speech Commun..

[5]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[6]  Mark A Gregory,et al.  A novel approach for MFCC feature extraction , 2010, 2010 4th International Conference on Signal Processing and Communication Systems.

[7]  S. Furui,et al.  Speaker-independent isolated word recognition based on emphasized spectral dynamics , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[9]  Michael Picheny,et al.  Robust methods for using context-dependent features and models in a continuous speech recognizer , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  George Saon,et al.  Maximum likelihood discriminant feature spaces , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[11]  Alan V. Oppenheim,et al.  Discrete-Time Signal Pro-cessing , 1989 .

[12]  Zheng Fang,et al.  Comparison of different implementations of MFCC , 2001 .

[13]  Richard M. Stern,et al.  Delta-spectral cepstral coefficients for robust speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  张国亮,et al.  Comparison of Different Implementations of MFCC , 2001 .

[15]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[16]  Donghui Guo,et al.  Speaker recognition using weighted dynamic MFCC based on GMM , 2010, 2010 International Conference on Anti-Counterfeiting, Security and Identification.

[17]  Andreas G. Andreou,et al.  Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..

[18]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[19]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[20]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[21]  Jinyu Li,et al.  Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks. , 2013, ICLR 2013.

[22]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[23]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[24]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[25]  Alan V. Oppenheim,et al.  Discrete-time signal processing (2nd ed.) , 1999 .

[26]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .