A diphone-based digit recognition system using neural networks

In exploring new ways of looking at speech data, we have developed an alternative method of segmentation for training a neural-network-based digit-recognition system. Whereas previous methods segment the data into monophones, biphones, or triphones and train on each sub-phone unit in several broad-category contexts, our new method uses modified diphones to train on the regions of greatest spectral change as well as the regions of greatest stability. Although we account for regions of spectral stability, we do not require their presence in our word models. Empirical evidence for the advantage of this new method is seen by the 13% reduction in word-level error that was achieved on a test set of the OGI Numbers corpus. Comparison was made to a baseline system that used context-independent monophones and context-dependent biphones and triphones.

[1]  Jonathan G. Fiscus,et al.  Better alignment procedures for speech recognition evaluation , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Yonghong Yan,et al.  Speech recognition using neural networks with forward-backward probability generated targets , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Ronald A. Cole,et al.  New telephone speech corpora at CSLU , 1995, EUROSPEECH.

[4]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[5]  Ronald A. Cole,et al.  Real-world speech recognition with neural networks , 1995, SPIE Defense + Commercial Sensing.