Continuous speech recognition by means of acoustic/ Phonetic classification obtained from a hidden Markov model

This paper describes an experimental continuous speech recognition system comprising procedures for acoustic/phonetic classification, lexical access and sentence retrieval. Speech is assumed to be composed of a small number of phonetic units which may be identified with the states of a hidden Markov model. The acoustic correlates of the phonetic units are then characterized by the observable Gaussian process associated with the corresponding state of the underlying Markov chain. Once the parameters of such a model are determined, a phonetic transcription of an utterance can be obtained by means of a Viterbi-like algorithm. Given a lexicon in which each entry is orthographically represented in terms of the chosen phonetic units, a word lattice is produced by a lexical access procedure. Lexical items whose orthography matches subsequences of the phonetic transcription are sought by means of a hash coding technique and their likelihoods are computed directly from the corresponding interval of acoustic measurements. The recognition process is completed by recovering from the word lattice, the string of words of maximum likelihood conditioned on the measurements. The desired string is derived by a best-first search algorithm. In an experimental evaluation of the system, the parameters of an acoustic/phonetic model were estimated from fluent utterances of 37 seven-digit numbers. A digit recognition rate of 96% was then observed on an independent test set of 59 utterances of the same form from the same speaker. Half of the observed errors resulted from insertions while deletions and substitutions accounted equally for the other half.