Warped Minimum Variance Distortionless Response based bottle neck features for LVCSR

This paper presents the results of our experiments on bottleneck feature applied to a wMVDR (Warped Minimum Variance Distortionless Response) frontend. We examine how to best optimize wMVDR-BNF features and wMVDR combined with MFCC bottleneck features (wMVDR+MFCC-BNF). Our wMVDR+MFCC-BNF frontend improves a single pass system from 18.7% (20.7%) to 18.1% compared to a MFCC-BNF (MFCC) system tested on the Quaero 2010 German evaluation set. When used in a system combination our wMVDR-BNF and wMVDR+MFCC-BNF systems reduced the overall WER from 14.3% to 13.3% on the IWSLT 2010 test set while at the same time reducing the number of systems needed from 9 to 5. Our result of 11.9% on the 2012 IWSLT testset is better than the best result submitted during the evaluation campaign.

[1]  Sebastian Stüker,et al.  The 2011 KIT QUAERO speech-to-text system for Spanish , 2011, IWSLT.

[2]  Hynek Hermansky,et al.  TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[3]  Sebastian Stüker,et al.  Overview of the IWSLT 2012 evaluation campaign , 2012, IWSLT.

[4]  Bhaskar D. Rao,et al.  All-pole modeling of speech based on the minimum variance distortionless response spectrum , 2000, Conference Record of the Thirty-First Asilomar Conference on Signals, Systems and Computers (Cat. No.97CB36136).

[5]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[6]  Hermann Ney,et al.  Hierarchical bottle neck features for LVCSR , 2010, INTERSPEECH.

[7]  M. Wolfel,et al.  Minimum variance distortionless response spectral estimation , 2005, IEEE Signal Processing Magazine.

[8]  Dong Yu,et al.  Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[9]  Puming Zhan,et al.  Speaker normalization based on frequency warping , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[11]  Hermann Ney,et al.  Improved Acoustic Feature Combination for LVCSR by Neural Networks , 2011, INTERSPEECH.

[12]  Sebastian Stüker,et al.  Cross-system adaptation and combination for continuous speech recognition: the influence of phoneme set and acoustic front-end , 2006, INTERSPEECH.

[13]  Andreas Stolcke,et al.  On using MLP features in LVCSR , 2004, INTERSPEECH.

[14]  Alexander H. Waibel,et al.  Minimum variance distortionless response on a warped frequency scale , 2003, INTERSPEECH.

[15]  Jan Niehues,et al.  Quaero Speech-to-Text and Text Translation Evaluation Systems , 2010, High Performance Computing in Science and Engineering.

[16]  Sebastian Stüker,et al.  Quaero 2010 Speech-to-Text Evaluation Systems , 2011, High Performance Computing in Science and Engineering.

[17]  A. Waibel,et al.  A one-pass decoder based on polymorphic linguistic context assignment , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[18]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.