I N T E R N a T I O N a L C O M P U T E R S C I E N C E an Investigation of Tandem Mlp Features for Asr

This project explores speech feature representations produced by discriminatively trained multi-layer perceptrons. Previous research has demonstrated that such a tandem approach can be successfully exploited for large-vocabulary automatic speech recognition systems. The principal aim of this work is to empirically evaluate some variants of these features. While experimental results validate some of the design choices of the standard implementation, other evidence suggests alternatives that may improve performance. From this exploratory investigation, we hypothesize which of the various modifications are most promising; applied to a Mandarin broadcast news task, the new configuration demonstrates significant improvement. Along with the novel presentation of a " best-case scenario " and other cheating experiments, an interpretation of these results is discussed with the hope of guiding future directions of research. Abstract ii 1 Introduction

[1]  Mark J. F. Gales,et al.  Progress in the CU-HTK broadcast news transcription system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Andreas Stolcke,et al.  Improved discriminative training using phone lattices , 2005, INTERSPEECH.

[3]  Florian Metze,et al.  A flexible stream architecture for ASR using articulatory features , 2002, INTERSPEECH.

[4]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Daniel P. W. Ellis,et al.  Feature extraction using non-linear transformation for robust speech recognition on the Aurora database , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[6]  Hervé Bourlard,et al.  New entropy based combination rules in HMM/ANN multi-stream ASR , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[7]  Mei-Yuh Hwang,et al.  Incorporating tone-related MLP posteriors in the feature representation for Mandarin ASR , 2005, INTERSPEECH.

[8]  Frantisek Grézl,et al.  Improved MLP structures for data-driven feature extraction for ASR , 2005, INTERSPEECH.

[9]  Daniel P. W. Ellis,et al.  Investigations into tandem acoustic modeling for the Aurora task , 2001, INTERSPEECH.

[10]  Daniel P. W. Ellis,et al.  Size matters: an empirical study of neural network training for large vocabulary continuous speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[11]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[12]  Daniel P. W. Ellis,et al.  Error visualization for tandem acoustic modeling on the Aurora task , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Daniel Povey,et al.  Large scale MMIE training for conversational telephone speech recognition , 2000 .

[14]  Nelson Morgan,et al.  Tonotopic multi-layered perceptron: a neural network for learning long-term temporal features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[15]  Wen Wang,et al.  Investigation on Mandarin broadcast news speech recognition , 2006, INTERSPEECH.

[16]  Jonathan G. Fiscus,et al.  Tools for the analysis of benchmark speech recognition tests , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[17]  Mei-Yuh Hwang,et al.  Improved tone modeling for Mandarin broadcast news speech recognition , 2006, INTERSPEECH.

[18]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Hynek Hermansky,et al.  Hierarchical tandem feature extraction , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Andreas Stolcke,et al.  Combining Discriminative Feature, Transform, and Model Training for Large Vocabulary Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[21]  Hynek Hermansky,et al.  Generalized tandem feature extraction , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[22]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[23]  Fabio Valente,et al.  Combination of Acoustic Classifiers Based on Dempster-Shafer Theory of Evidence , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[24]  Jonathan G. Fiscus,et al.  1998 Broadcast News Benchmark Test Results: English and Non-English Word Error Rate Performance Measures , 1998 .

[25]  Daniel P. W. Ellis,et al.  Tandem acoustic modeling in large-vocabulary recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[26]  Pavel Matejka,et al.  Hierarchical Structures of Neural Networks for Phoneme Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[27]  S. Wegmann,et al.  Speaker normalization on conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[28]  Shuangyu Chang,et al.  Learning discriminative temporal patterns in speech: development of novel TRAPS-like classifiers , 2003, INTERSPEECH.

[29]  Daniel P. W. Ellis,et al.  Connectionist speech recognition of Broadcast News , 2002, Speech Commun..

[30]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[31]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[32]  Jean-Marc Boite,et al.  Nonlinear discriminant analysis for improved speech recognition , 1997, EUROSPEECH.

[33]  Spyridon Matsoukas,et al.  Minimum phoneme error based heteroscedastic linear discriminant analysis for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[34]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[35]  Samy Bengio,et al.  Towards using hierarchical posteriors for flexible automatic speech recognition systems , 2004 .

[36]  Andreas Stolcke,et al.  Using MLP features in SRI's conversational speech recognition system , 2005, INTERSPEECH.

[37]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[38]  Pavel Matejka,et al.  Towards Lower Error Rates in Phoneme Recognition , 2004, TSD.

[39]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[40]  Andreas Stolcke,et al.  An efficient repair procedure for quick transcriptions , 2004, INTERSPEECH.

[41]  Simon King,et al.  Articulatory Feature-Based Methods for Acoustic and Audio-Visual Speech Recognition: Summary from the 2006 JHU Summer workshop , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.