The IBM 2008 GALE Arabic speech transcription system

This paper describes the Arabic broadcast transcription system fielded by IBM in the GALE Phase 3.5 machine translation evaluation. Key advances compared to our Phase 2.5 system include improved discriminative training, the use of Subspace Gaussian Mixture Models (SGMM), neural network acoustic features, variable frame rate decoding, training data partitioning experiments, unpruned n-gram language models and neural network language models. These advances were instrumental in achieving a word error rate of 8.9% on the evaluation test set.

[1]  Brian Kingsbury,et al.  The IBM 2008 GALE Arabic speech transcription system , 2010, ICASSP.

[2]  Brian Kingsbury,et al.  Advances in Arabic Speech Transcription at IBM Under the DARPA GALE Program , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Jen-Tzung Chien,et al.  Discriminative training for Bayesian sensing hidden Markov models , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Daniel Povey,et al.  Speaking rate adaptation using continuous frame rate normalization , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Jen-Tzung Chien,et al.  Bayesian sensing hidden Markov models for speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[8]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[10]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[11]  Stanley F. Chen,et al.  An empirical study of smoothing techniques for language modeling , 1999 .

[12]  Stanley F. Chen,et al.  Enhanced word classing for model M , 2010, INTERSPEECH.

[13]  George Saon,et al.  Penalty function maximization for large margin HMM training , 2008, INTERSPEECH.

[14]  Ahmad Emami,et al.  Empirical study of neural network language models for Arabic speech recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[15]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[16]  Nizar Habash,et al.  Improving the Arabic Pronunciation Dictionary for Phone and Word Recognition with Linguistically-Based Pronunciation Rules , 2009, HLT-NAACL.

[17]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[18]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[19]  Jen-Tzung Chien,et al.  Some properties of Bayesian sensing hidden Markov models , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[20]  Brian Kingsbury,et al.  The IBM Attila speech recognition toolkit , 2010, 2010 IEEE Spoken Language Technology Workshop.

[21]  Nizar Habash,et al.  Arabic Diacritization through Full Morphological Tagging , 2007, NAACL.

[22]  Kai Feng,et al.  Subspace Gaussian Mixture Models for speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Ebru Arisoy,et al.  Minimum Bayes risk discriminative language models for Arabic speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[24]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[25]  Ahmad Emami,et al.  Syntactic features for Arabic speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[26]  Andreas Stolcke,et al.  THE SRI MARCH 2000 HUB-5 CONVERSATIONAL SPEECH TRANSCRIPTION SYSTEM , 2000 .

[27]  S. Wegmann,et al.  Speaker normalization on conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[28]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[30]  Stanley F. Chen,et al.  Shrinking Exponential Language Models , 2009, NAACL.

[31]  George Saon,et al.  Dynamic network decoding revisited , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.