Hierarchical mixtures of experts methodology applied to continuous speech recognition

In this paper, we incorporate the hierarchical mixtures of experts (HME) method of probability estimation, developed by Jordan (1994), into a hidden Markov model (HMM)-based continuous speech recognition system. The resulting system can be thought of as a continuous-density HMM system, but instead of using Gaussian mixtures, the HME system employs a large set of hierarchically organized but relatively small neural networks to perform the probability density estimation. The hierarchical structure is reminiscent of a decision tree except for two important differences: each "expert" or neural net performs a "soft" decision rather than a hard decision, and, unlike ordinary decision trees, the parameters of all the neural nets in the HME are automatically trainable using the expectation-maximisation algorithm. We report results on the ARPA 5,000-word and 40,000-word Wall Street Journal corpus using HME models.