The Kaldi Speech Recognition Toolkit

We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state automata (using the freely available OpenFst), together with detailed documentation and a comprehensive set of scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users.

[1]  Henrique S. Malvar,et al.  Signal processing with lapped transforms , 1992 .

[2]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[3]  Steve J. Young,et al.  Large vocabulary continuous speech recognition using HTK , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[5]  Mark J. F. Gales,et al.  The generation and use of regression class trees for MLLR adaptation , 1996 .

[6]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[7]  Andreas G. Andreou,et al.  Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..

[8]  Ramesh A. Gopinath,et al.  Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[10]  Daniel Povey,et al.  Frame discrimination training for HMMs for large vocabulary speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[11]  Wu Chou,et al.  Robust decision tree state tying for continuous speech recognition , 2000, IEEE Trans. Speech Audio Process..

[12]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[13]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[14]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[15]  Mark J. F. Gales,et al.  Using VTLN for broadcast news transcription , 2004, INTERSPEECH.

[16]  Paul Lamere,et al.  Sphinx-4: a flexible open source framework for speech recognition , 2004 .

[17]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[18]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[19]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[20]  Georg Heigold,et al.  The RWTH aachen university open source speech recognition system , 2009, INTERSPEECH.

[21]  Brian Kingsbury,et al.  The IBM Attila speech recognition toolkit , 2010, 2010 IEEE Spoken Language Technology Workshop.

[22]  Geoffrey Zweig,et al.  Speaker adaptation with an Exponential Transform , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[23]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..