Multi-source neural networks based on fixed and multiple resolution analysis for speech recognition

This paper reports the results obtained by an automatic speech recognition system when MFCCs, J-RASTA perceptual linear prediction coefficients (J-Rasta PLP) and energies from a multi-resolution analysis (MRA) tree of filters are used as input features to a hybrid system consisting of a neural network (NN) which provides observation probabilities for a network of hidden Markov models. Furthermore, the paper compares the performance of the system when various combinations of these features are used showing a WER reduction of 20% with respect to the use of J-Rasta PLP coefficients, when J-Rasta PLP coefficients are combined with the energies computed at the output of the leaves of an MRA filter tree. Such a combination is practically feasible due to the use of a NN architecture designed to integrate multiple features, exploiting the NN capability of mixing several input parameters without any assumption about their stochastic independence. Recognition is performed on a very large test set including many speakers uttering proper names from different locations of the Italian public telephone network.

[1]  Roberto Gemello,et al.  Multi-source neural networks for speech recognition , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[2]  M. Victor Wickerhauser,et al.  Adapted wavelet analysis from theory to software , 1994 .

[3]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[4]  Roberto Gemello,et al.  Multi-source neural networks for speech recognition: a review of recent results , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[5]  J.H.L. Hansen,et al.  High resolution speech feature parametrization for monophone-based stressed speech recognition , 2000, IEEE Signal Processing Letters.

[6]  Jont B. Allen,et al.  How do humans process and recognize speech? , 1993, IEEE Trans. Speech Audio Process..

[7]  Ewa Lukasik Wavelet packets based features selection for voiceless plosives classification , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[8]  Zekeriya Tufekci,et al.  Mel-scaled discrete wavelet coefficients for speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9]  I. W. Selenick Formulas for orthogonal IIR wavelet filters , 1998 .

[10]  A. Enis Çetin,et al.  The Teager energy based feature parameters for robust speech recognition in car noise , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[11]  Donald MacFarlane,et al.  Networks for speech recognition structurally optimised by genetic techniques implemented on parallel hardware , 1991, EUROSPEECH.