Exploring Joint Equalization of Spatial-Temporal Contextual Statistics of Speech Features for Robust Speech Recognition

Histogram equalization (HEQ) of speech features has recently become an active focus of much research in the field of robust speech recognition due to its inherent neat formulation and remarkable performance. Our work in this paper continues this general line of research in two significant aspects. First, a novel framework for joint equalization of spatial-temporal contextual statistics of speech features is proposed. For this idea to work, we leverage simple differencing and averaging operations to render the contextual relationships of feature vector components, not only between different dimensions but also between consecutive speech frames, for speech feature normalization. Second, we exploit a polynomial-fitting scheme to efficiently approximate the inverse of the cumulative density function of training speech, so as to work in conjunction with the presented normalization framework. As such, it provides the advantages of lower storage and time consumption when compared with the conventional HEQ methods. All experiments were carried out on the Aurora-2 database and task. The performance of the methods deduced from our proposed framework was thoroughly tested and verified by comparisons with other popular robustness methods, which suggests the utility of our methods.

[1]  Berlin Chen,et al.  Robust speech recognition using spatial-temporal feature distribution characteristics , 2011, Pattern Recognit. Lett..

[2]  José L. Pérez-Córdoba,et al.  Histogram equalization of speech representation for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[3]  Vikas Joshi,et al.  Sub-Band Level Histogram Equalization for Robust Speech Recognition , 2011, INTERSPEECH.

[4]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[5]  Mukund Padmanabhan,et al.  A nonlinear unsupervised adaptation technique for speech recognition , 2000, INTERSPEECH.

[6]  Berlin Chen and Shih-Hsiang Lin Distribution-Based Feature Compensation for Robust Speech Recognition , 2011 .

[7]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[8]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[9]  Berlin Chen,et al.  Exploring the Use of Speech Features and Their Corresponding Distribution Characteristics for Robust Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Hermann Ney,et al.  Matching training and test data distributions for robust speech recognition , 2003, Speech Commun..

[11]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[12]  Hermann Ney,et al.  Quantile based histogram equalization for noise robust large vocabulary speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Denis Jouvet,et al.  Evaluation of a noise-robust DSR front-end on Aurora databases , 2002, INTERSPEECH.