BIDIRECTIONAL NEURAL NETWORK FOR FEATURE COMPENSATION OF CLEAN AND TELEPHONE SPEECH SIGNALS

In this paper, we continue our previous work on nonlinear feature compensation of distortions in clean and telephone speech recognition systems. We have shown that Bidirectional Neural Network (Bidi-NN) can compensate nonlinearly-distorted components of feature vectors. In this study, we present a new effort to improve recognition accuracy on clean and telephone speech data by employing a two-stage feature compensation technique for recovering optimal (from a classification point of view) Log-Filter Bank Energies (LFBE). These new features are achieved by training a new Bidi-NN with compensated features and considering compensated feature as the input data to Bidi-NN. We also achieved MFCC features by applying discrete cosine transform (DCT) to compensated Log-Filter Bank Energies (LFBE) features. HMM phone models are trained on these modified features. By using the two-stage compensated features, we obtained an absolute improvement of 4.73% and 9.29% in phone recognition accuracy compared to baseline system in clean and telephone conditions respectively. We also obtained an absolute improvement of 25.67% in phone recognition accuracy for the system which was trained on clean data but tested on telephone data. These results show excellency of NN-based nonlinear compensation of speech feature vectors in HMM-based speech recognition systems.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  M Bijankhan,et al.  FARSDAT- THE SPEECH DATABASE OF FARSI SPOKEN LANGUAGE , 1994 .

[3]  Hans-Günter Hirsch HMM adaptation for applications in telecommunication , 2001, Speech Commun..

[4]  Damjan Vlaj,et al.  Efficient Noise Robust Feature Extraction Algorithms for Distributed Speech Recognition (DSR) Systems , 2003, Int. J. Speech Technol..

[5]  Mahmood Bijankhan,et al.  Tfarsdat - the telephone farsi speech database , 2003, INTERSPEECH.

[6]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[7]  Sebastian Möller,et al.  Quality of Telephone-Based Spoken Dialogue Systems , 2005 .

[8]  John H. L. Hansen,et al.  Statistical class-based MFCC enhancement of filtered and band-limited speech for robust ASR , 2005, INTERSPEECH.

[9]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[10]  Richard M. Stern,et al.  Band-Independent Mask Estimation for Missing-Feature Reconstruction in the Presence of Unknown Background Noise , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Seyyed Ali Seyyed Salehi,et al.  Robust speech recognition by modifying clean and telephone feature vectors using bidirectional neural network , 2006, INTERSPEECH.

[12]  Alex Acero,et al.  Training Wideband Acoustic Models Using Mixed-Bandwidth Training Data for Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  John H. L. Hansen,et al.  Time–Frequency Correlation-Based Missing-Feature Reconstruction for Robust Speech Recognition in Band-Restricted Conditions , 2009, IEEE Transactions on Audio, Speech, and Language Processing.