Context dependent hybrid HMM/ANN systems for large vocabulary continuous speech recognition system

In this paper, hybrid HMM/ANN systems are used to model context dependent phones. In order to reduce the number of parameters as well as to better catch the dynamics of the phonetic segments, we combine (context dependent) diphone models with context independent phone models. Transitions from phone to phone are modeled as generalized context dependent distributions while phonetic units are context independent models trained on the less coarticulated middle part of each phone. Words are thus modeled as a sequence of probability distributions alternately representing the middle part of the phoneme and the transition to the next phone. A single neural network is used to estimate both context independent phone probabilities and generalized context dependent diphone (phone to phone transition) probabilities. Resulting systems are compared to classical hybrid HMM/ANN system with the same number of parameters. The Phonebook isolated word database, the Resource Management and the Wall Street Journal continuous speech databases have been used for training and testing the new methods.