An Experiment with Feed-Forward Neural Network for Speech Recognition

This article deals with continuous speech recognition of Slovak digits exploiting ANN (artificial neural network) architecture. Feed forward neural network with one hidden layer is used in experiments. We applied 5-frames wide context window of 26 mel-frequency cepstral Coefficients (MFCC) with energy and deltas included (130 features) as input for neural network to categorise central speech frame (third of five frames). The hidden layer has 200 units. Neural network output units provide posterior probabilities of their corresponding phonetic categories. We used 238 context-dependent phoneme-based phonetic categories. Time matrix of these probabilities is searched by Viterbi search (constrained by pronunciations and grammar) to get the most probable digit string hypothesis. Our experiments were performed using CSLU (Center for Spoken Language Understanding — Oregon Graduate Institute of Science and Technology) speech toolkit [1].

[1]  H. Bourlard,et al.  Links Between Markov Models and Multilayer Perceptrons , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Hervé Bourlard,et al.  Continuous speech recognition , 1995, IEEE Signal Process. Mag..