A comparative study of fMPE and RDLT approaches to LVCSR

This paper presents a comparative study of two discriminatively trained feature transform approaches, namely feature-space minimum phone error (fMPE) and region-dependent linear transform (RDLT), to large vocabulary continuous speech recognition (LVCSR). Experiments are performed on an LVCSR task of conversational telephone speech transcription using about 2,000 hours training data. Starting from a maximum likelihood (ML) trained GMM-HMM based baseline system, recognition accuracy and run-time efficiency of different variants of the above two methods are evaluated, and a specific RDLT approach is identified and recommended for deployment in LVCSR applications.

[1]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[2]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  S. Kullback,et al.  Information Theory and Statistics , 1959 .

[4]  Richard M. Schwartz,et al.  Recent progress on the discriminative region-dependent transform for speech feature extraction , 2006, INTERSPEECH.

[5]  Lukás Burget,et al.  Region dependent linear transforms in multilingual speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[7]  Martin Karafi iVector-Based Discriminative Adaptation for Automatic Speech Recognition , 2011 .

[8]  Lukás Burget,et al.  iVector-based discriminative adaptation for automatic speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[9]  George Saon,et al.  Maximum likelihood discriminant feature spaces , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10]  Richard M. Schwartz,et al.  Discriminatively Trained Region Dependent Feature Transforms for Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Daniel Povey,et al.  Improvements to fMPE for discriminative training of features , 2005, INTERSPEECH.

[12]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Bin Ma,et al.  Online adaptive learning of continuous-density hidden Markov models based on multiple-stream prior evolution and posterior pooling , 2001, IEEE Trans. Speech Audio Process..