Combining tandem and hybrid systems for improved speech recognition and keyword spotting on low resource languages

Copyright © 2014 ISCA. In recent years there has been significant interest in Automatic Speech Recognition (ASR) and KeyWord Spotting (KWS) systems for low resource languages. One of the driving forces for this research direction is the IARPA Babel project. This paper examines the performance gains that can be obtained by combining two forms of deep neural network ASR systems, Tandem and Hybrid, for both ASR and KWS using data released under the Babel project. Baseline systems are described for the five option period 1 languages: Assamese; Bengali; Haitian Creole; Lao; and Zulu. All the ASR systems share common attributes, for example deep neural network configurations, and decision trees based on rich phonetic questions and state-position root nodes. The baseline ASR and KWS performance of Hybrid and Tandem systems are compared for both the "full", approximately 80 hours of training data, and limited, approximately 10 hours of training data, language packs. By combining the two systems together consistent performance gains can be obtained for KWS in all configurations.

[1]  Igor Sz SUB-WORD MODELING OF OUT OF VOCABULARY WORDS IN SPOKEN TERM DETECTION , 2008 .

[2]  Lukás Burget,et al.  Sub-word modeling of out of vocabulary words in spoken term detection , 2008, 2008 IEEE Spoken Language Technology Workshop.

[3]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[4]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[5]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[6]  Andreas Stolcke,et al.  The SRI/OGI 2006 spoken term detection system , 2007, INTERSPEECH.

[7]  Brian Kingsbury,et al.  Exploiting diversity for spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[9]  George Zavaliagkos,et al.  Utilizing untranscribed training data to improve perfomance , 1998, LREC.

[10]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[11]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[12]  Jonathan G. Fiscus,et al.  REDUCED WORD ERROR RATES , 1997 .

[13]  Xiaodong Cui,et al.  Data Augmentation for Deep Neural Network Acoustic Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Nizar Habash,et al.  Unsupervised Morphology-Based Vocabulary Expansion , 2014, ACL.

[15]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[16]  Xiaodong Cui,et al.  A high-performance Cantonese keyword search system , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[18]  Xiaodong Cui,et al.  System combination and score normalization for spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[20]  Tanja Schultz,et al.  FLEXIBLE DECISCION TREES FOR GRAPHEME BASED SPEECH RECOGNITION , 2004 .

[21]  Mark J. F. Gales,et al.  Progress in the CU-HTK broadcast news transcription system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[23]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[24]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[25]  Mark J. F. Gales,et al.  The efficient incorporation of MLP features into automatic speech recognition systems , 2011, Comput. Speech Lang..

[26]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[27]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[29]  Gunnar Evermann,et al.  Large vocabulary decoding and confidence estimation using word posterior probabilities , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[30]  Martin Karafiát,et al.  Study of probabilistic and Bottle-Neck features in multilingual environment , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.