Weighted finite-state transducers in speech recognition

We survey the use of weighted finite-state transducers (WFSTs) in speech recognition. We show that WFSTs provide a common and natural representation for hidden Markov models (HMMs), context-dependency, pronunciation dictionaries, grammars, and alternative recognition outputs. Furthermore, general transducer operations combine these representations flexibly and efficiently. Weighted determinization and minimization algorithms optimize their time and space requirements, and a weight pushing algorithm distributes the weights along the paths of a weighted transducer optimally for speech recognition. As an example, we describe a North American Business News (NAB) recognition system built using these techniques that combines the HMMs, full cross-word triphones, a lexicon of 40 000 words, and a large trigram grammar into a single weighted transducer that is only somewhat larger than the trigram word grammar and that runs NAB in real-time on a very simple decoder. In another example, we show that the same techniques can be used to optimize lattices for second-pass recognition. In a third example, we show how general automata operations can be used to assemble lattices from different recognizers to improve recognition performance.

[1]  Jack W. Carlyle,et al.  Realizations by Stochastic Finite Automata , 1971, J. Comput. Syst. Sci..

[2]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[3]  Arto Salomaa,et al.  Automata-Theoretic Aspects of Formal Power Series , 1978, Texts and Monographs in Computer Science.

[4]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[5]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[6]  Arto Salomaa,et al.  Semirings, Automata, Languages , 1985, EATCS Monographs on Theoretical Computer Science.

[7]  Jean Berstel,et al.  Rational series and their languages , 1988, EATCS monographs on theoretical computer science.

[8]  Dominique Revuz,et al.  Minimisation of Acyclic Deterministic Automata in Linear Time , 1992, Theor. Comput. Sci..

[9]  Fernando Pereira,et al.  Weighted Rational Transductions and their Application to Human Language Processing , 1994, HLT.

[10]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[11]  Martin Kay,et al.  Regular Models of Phonological Rule Systems , 1994, CL.

[12]  Lauri Karttunen,et al.  The Replace Operator , 1995, ACL.

[13]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[14]  Michael Riley,et al.  Speech Recognition by Composition of Weighted Finite Automata , 1996, ArXiv.

[15]  Richard Sproat,et al.  An Efficient Compiler for Weighted Rewrite Rules , 1996, ACL.

[16]  M. A. Hill An Eecient Compiler for Weighted Rewrite Rules , 1996 .

[17]  Richard Sproat,et al.  Multilingual text analysis for text-to-speech synthesis , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[18]  Mehryar Mohri,et al.  On some applications of finite-state automata theory to natural language processing , 1996, Nat. Lang. Eng..

[19]  Hermann Ney,et al.  Language-model look-ahead for large vocabulary speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[20]  Richard Sproat Multilingual text analysis for text-to-speech synthesis , 1996, Nat. Lang. Eng..

[21]  Ronald Rosenfeld,et al.  Scalable backoff language models , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[22]  Yves Schabes,et al.  Speech Recognition by Composition of Weighted Finite Automata , 1997 .

[23]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[24]  Grzegorz Rozenberg,et al.  Handbook of Formal Languages , 1997, Springer Berlin Heidelberg.

[25]  Jarkko Kari,et al.  Digital Images and Formal Languages , 1997, Handbook of Formal Languages.

[26]  Fernando Pereira,et al.  Transducer composition for context-dependent network expansion , 1997, EUROSPEECH.

[27]  Emmanuel Roche,et al.  Finite-State Language Processing , 1997 .

[28]  Mehryar Mohri,et al.  Finite-State Transducers in Language and Speech Processing , 1997, CL.

[29]  Yves Schabes,et al.  Finite-State Approximation of Phrase-Structure Grammars , 1997 .

[30]  Andrej Ljolje,et al.  Full expansion of context-dependent networks in large vocabulary speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[31]  Mehryar Mohri,et al.  Dynamic compilation of weighted context-free grammars , 1998, COLING.

[32]  Mehryar Mohri,et al.  Network optimizations for large-vocabulary speech recognition , 1999, Speech Commun..

[33]  William J. Byrne,et al.  Stochastic pronunciation modelling from hand-labelled phonetic corpora , 1999, Speech Commun..

[34]  Fernando Pereira,et al.  Efficient general lattice generation and rescoring , 1999, EUROSPEECH.

[35]  Mehryar Mohri,et al.  Integrated context-dependent networks in very large vocabulary speech recognition , 1999, EUROSPEECH.

[36]  Mehryar Mohri,et al.  Rapid unit selection from a large speech corpus for concatenative speech synthesis , 1999, EUROSPEECH.

[37]  Andrej Ljolje,et al.  The AT&T LVCSR-2000 System , 2000 .

[38]  Mark-Jan Nederhof,et al.  Practical Experiments with Regular Approximation of Context-Free Languages , 1999, CL.

[39]  Mehryar Mohri,et al.  The Design Principles of a Weighted Finite-State Transducer Library , 2000, Theor. Comput. Sci..

[40]  Mehryar Mohri,et al.  A weight pushing algorithm for large vocabulary speech recognition , 2001, INTERSPEECH.

[41]  Mark-Jan Nederhof,et al.  Regular Approximation of Context-Free Grammars through Transformation , 2001 .

[42]  Jean-Claude Junqua,et al.  Robustness in Language and Speech Technology , 2001, Text, Speech and Language Technology.