Robust speech recognition using spatial-temporal feature distribution characteristics

Histogram equalization (HEQ) is one of the most efficient and effective techniques that have been used to reduce the mismatch between training and test acoustic conditions. However, most of the current HEQ methods are merely performed in a dimension-wise manner and without allowing for the contextual relationships between consecutive speech frames. In this paper, we present several novel HEQ approaches that exploit spatial-temporal feature distribution characteristics for speech feature normalization. The automatic speech recognition (ASR) experiments were carried out on the Aurora-2 standard noise-robust ASR task. The performance of the presented approaches was thoroughly tested and verified by comparisons with the other popular HEQ methods. The experimental results show that for clean-condition training, our approaches yield a significant word error rate reduction over the baseline system, and also give competitive performance relative to the other HEQ methods compared in this paper.

[1]  Seung Ho Choi,et al.  Cepstrum third-order normalisation method for noisy speech recognition , 1999 .

[2]  Berlin Chen,et al.  Exploring the Use of Speech Features and Their Corresponding Distribution Characteristics for Robust Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[4]  Hsin-Min Wang,et al.  A Discriminative and Heteroscedastic Linear Feature Transformation for Multiclass Classification , 2010, 2010 20th International Conference on Pattern Recognition.

[5]  John H. L. Hansen,et al.  Recent Advances in Robust Speech Recognition Technology , 2012 .

[6]  Berlin Chen,et al.  Exploiting spatial-temporal feature distribution characteristics for robust speech recognition , 2008, INTERSPEECH.

[7]  Jeih-Weih Hung,et al.  Speech feature compensation based on pseudo stereo codebooks for robust speech recognition in additive noise environments , 2007, INTERSPEECH.

[8]  Mark J. F. Gales,et al.  Model-based techniques for noise robust speech recognition , 1995 .

[9]  Lin-Shan Lee,et al.  Higher Order Cepstral Moment Normalization for Improved Robust Speech Recognition , 2009, IEEE Trans. Speech Audio Process..

[10]  Jinyu Li,et al.  On a generalization of margin-based discriminative training to robust speech recognition , 2008, INTERSPEECH.

[11]  Pedro J. Moreno,et al.  Speech recognition in noisy environments , 1996 .

[12]  George Saon,et al.  Feature space Gaussianization , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Berlin Chen and Shih-Hsiang Lin Distribution-Based Feature Compensation for Robust Speech Recognition , 2011 .

[14]  Jeff A. Bilmes,et al.  MVA Processing of Speech Features , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Rubén Cañedo Andalia,et al.  Bentham Science Publishers , 2008 .

[16]  Yu Tsao,et al.  Segmental eigenvoice with delicate eigenspace for improved speaker adaptation , 2005, IEEE Transactions on Speech and Audio Processing.

[17]  Berlin Chen,et al.  Generalized likelihood ratio discriminant analysis , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[18]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[19]  Oscar Saz-Torralba,et al.  Cepstral Vector Normalization Based on Stereo Data for Robust Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Denis Jouvet,et al.  Evaluation of a noise-robust DSR front-end on Aurora databases , 2002, INTERSPEECH.

[21]  Tinku Acharya,et al.  Image Processing: Principles and Applications , 2005, J. Electronic Imaging.

[22]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[23]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[24]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[25]  Antonio Rubio,et al.  Histogram Equalization for Robust Speech Recognition , 2008 .

[26]  Hermann Ney,et al.  Matching training and test data distributions for robust speech recognition , 2003, Speech Commun..

[27]  Li Deng,et al.  High-performance robust speech recognition using stereo training data , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[28]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[30]  Mukund Padmanabhan,et al.  A nonlinear unsupervised adaptation technique for speech recognition , 2000, INTERSPEECH.

[31]  Jacob Benesty,et al.  Springer handbook of speech processing , 2007, Springer Handbooks.

[32]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[33]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  A.V. Oppenheim,et al.  Enhancement and bandwidth compression of noisy speech , 1979, Proceedings of the IEEE.

[35]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[36]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[37]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[38]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[39]  Jean-Claude Junqua,et al.  Robustness in Automatic Speech Recognition , 1996 .

[40]  Qiang Huo,et al.  An Environment-Compensated Minimum Classification Error Training Approach Based on Stochastic Vector Mapping , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Andreas G. Andreou,et al.  Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition , 1997 .

[42]  Lin-shan Lee,et al.  Spoken document understanding and organization , 2005, IEEE Signal Processing Magazine.

[43]  Hermann Ney,et al.  Quantile based histogram equalization for noise robust large vocabulary speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[45]  José L. Pérez-Córdoba,et al.  Histogram equalization of speech representation for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[46]  Naoya Wada,et al.  Cepstral gain normalization for noise robust speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.