ISADORA | a Speech Modelling Network Based on Hidden Markov Models

In this paper we present the ISADORA system which provides highly exible speech recognition based on HMM technology together with an hierarchical representation of speech units. Markov model topologies, subword unit inventories, regular grammars expressed in nite-state or phrase structure style, and even the analysis tasks themselves are explicitly represented by the nodes of a large speech unit network. Thus, nothing that can be \said in the language of Markov models" needs to be hard-wired in the program code. In contrast to traditional compiled network recognizers, units, grammars, and tasks may be created or modiied at analysis time, and the outcome of the decoding process is a structured symbolic description of the sensory input. Our architecture has proven extremely useful in prototyping new kinds of subword units. Besides generalized triphones and context-freezing units, a new subword speech unit for automatic speech recognition has been implemented. The so-called polyphones are phone-like units which generalize the well-known concept of triphone units in that more than one left or right context symbol is allowed. Moreover, context items may be of segmental or suprasegmental nature. Moreover, a powerful new training paradigm based on the propagation of statistical parameters through the speech unit network has been introduced. The propagation-based Baum-Welch training algorithm is capable of fast and robust estimation of very large parameter sets | the real-time factor for training is very low (0.3) and independent of utterance duration and model complexity. The paper closes with the presentation of performance gures for numerous continuous speech recognition experiments. Choosing a suitable inventory of polyphones as subword units, a 162-word (or 1081-word, resp.) vocabulary, and using no grammar, speaker-dependent training yielded a word accuracy of 98.3 % (92.4 %). In the speaker-independent mode, accuracies of 91.8 % (84.5 %) have be achieved. This performance is among the best ones reported so far for speaker-independent large-vocabulary continuous speech recognition.

[1]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[2]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[3]  J. Baker,et al.  The DRAGON system--An overview , 1975 .

[4]  Charles C. Tappert A Markov model acoustic phonetic component for automatic speech recognition , 1976, ICASSP.

[5]  Bruce T. Lowerre,et al.  The HARPY speech recognition system , 1976 .

[6]  Lalit R. Bahl,et al.  Automatic recognition of continuously spoken sentences from a finite state grammer , 1978, ICASSP.

[7]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[8]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[9]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[10]  Bruce Lowerre,et al.  The Harpy speech understanding system , 1990 .

[11]  Günther Ruske,et al.  The efficiency of demisyllable segmentation in the recognition of spoken words , 1981, ICASSP.

[12]  Lynn Wilcox,et al.  Acoustic pattern matching and beam searching , 1982, ICASSP.

[13]  L. R. Rabiner,et al.  An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[14]  Maxine D. Brown,et al.  Continuous connected word recognition using whole word templates , 1983 .

[15]  Hermann Ney,et al.  The use of a one-stage dynamic programming algorithm for connected word recognition , 1984 .

[16]  B.-H. Juang,et al.  Maximum-likelihood estimation for mixture multivariate stochastic observations of Markov chains , 1985, AT&T Technical Journal.

[17]  S.E. Levinson,et al.  Structural methods in automatic speech recognition , 1985, Proceedings of the IEEE.

[18]  Hermann Ney,et al.  A script-guided algorithm for the automatic segmentation of continuous speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Frederick Jelinek,et al.  The development of an experimental discrete dictation recognizer , 1985 .

[20]  John Makhoul,et al.  Context-dependent modeling for acoustic-phonetic recognition of continuous speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  S. Roucos,et al.  The role of word-dependent coarticulatory effects in a phoneme-based speech recognition system , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Lalit R. Bahl,et al.  A new algorithm for the estimation of hidden Markov model parameters , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[23]  Lawrence R. Rabiner,et al.  Mathematical foundations of hidden Markov models , 1988 .

[24]  Jerome R. Bellegarda,et al.  Tied mixture continuous parameter models for large vocabulary isolated speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[25]  Adrian Fourcin,et al.  Speech input and output assessment: multilingual methods and standards , 1989 .

[26]  Hermann Ney,et al.  Continuous-speech recognition using a stochastic language model , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[27]  Gerhard Rigoll,et al.  An information theory approach to speaker adaptation , 1989, EUROSPEECH.

[28]  Paul Bamberg,et al.  Phoneme-in-Context Modeling for Dragon's Continuous Speech Recognizer , 1990, HLT.

[29]  Li Deng,et al.  Large vocabulary word recognition using context-dependent allophonic hidden Markov models☆ , 1990 .

[30]  Xuedong Huang,et al.  Semi-continuous hidden Markov models for speech signals , 1990 .

[31]  Richard M. Schwartz,et al.  Efficient, High-Performance Algorithms for N-Best Search , 1990, HLT.

[32]  Li Deng The semi-relaxed algorithm for estimating parameters of hidden Markov models , 1991 .

[33]  Heinrich Niemann,et al.  Das ISADORA-System - ein akustisch-phonetisches Netzwerk zur automatischen Spracherkennung , 1991, DAGM-Symposium.

[34]  K.-F. Lee,et al.  CMU robust vocabulary-independent speech recognition system , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[35]  M. J. Hunt,et al.  An investigation of PLP and IMELDA acoustic representations and of their potential for combination , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[36]  Aaron E. Rosenberg,et al.  On the use of inter-word context-dependent units for word juncture modeling , 1992 .

[37]  S. Rieck,et al.  Speaker adaptation using semi-continuous hidden Markov models , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol. III. Conference C: Image, Speech and Signal Analysis,.

[38]  S. Rieck,et al.  Acoustic modelling of subword units in the Isadora speech recognizer , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[39]  Franz Kummert,et al.  Semantic hidden Markov networks , 1992, ICSLP.

[40]  Gerhard Th. Niedermair,et al.  Linguistic modelling in the context of oral dialogue , 1992, ICSLP.

[41]  Heinrich Niemann,et al.  Automatic speech recognition without phonemes , 1993, EUROSPEECH.

[42]  Heinrich Niemann,et al.  A non-metrical space search algorithm for fast Gaussian vector quantization , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.