Reducing errors by increasing the error rate: MLP Acoustic Modeling for Broadcast News Transcription

We describe some aspects of a Broadcast News recognition system based on hybrid HMM/MLP acoustic modeling. These include the use of novel ‘modulation spectrogram’ features which are combined with conventional models at the posterior probability level, some experiments with nonlinear segment normalization, and an investigation of the interaction of model size and training set size for an multilayer perceptron (MLP) acoustic classifier. We also report preliminary results of incorporating gender-dependence into this system. 1. Background In recent years, we and our colleagues have promoted the exploration of novel, poorly understood, but promising approaches to speech recognition [2]. While such deviations from incremental improvements might initially hurt performance, the subset of the new methods that would ultimately prove useful would not be found without such explorations. This past year, we attempted to follow this advice, while still developing a system with reasonable performance on the automatic transcription of Broadcast News speech. An additional goal was finding approaches that would work well in combination with components developed by our SPRACH partners at Cambridge and Sheffield. Finally, previous published results seemed to indicate that, while the hybrid HMM/connectionist approach was successful for moderate sized training corpora, it did not appear to take advantage of significant increases in the size of the corpus. Recently improved computational capabilities at ICSI permitted tests to determine if this was true. Given these considerations, we developed experimental Broadcast News systems that incorporated:

[1]  Brian Kingsbury,et al.  Spert-II: A Vector Microprocessor System , 1996, Computer.

[2]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[3]  Ciro Martins,et al.  Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system , 1995, EUROSPEECH.

[4]  Brian Kingsbury,et al.  An Overview of the SPRACH System for the Transcription of Broadcast News , 1999 .

[5]  Hynek Hermansky,et al.  Towards increasing speech recognition error rates , 1995, Speech Commun..

[6]  Mei-Yuh Hwang,et al.  Improved Hidden Markov Modeling for Speaker-Independent Continuous Speech Recognition , 1990, HLT.

[7]  Yochai Konig,et al.  Connectionist gender adaptation in a hybrid neural network / hidden Markov model speech recognition system , 1992, ICSLP.

[8]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[9]  Nelson Morgan,et al.  Perceptually inspired signal processing strategies for robust speech recognition in reverberant environments , 1998 .

[10]  Steven Greenberg,et al.  Performance improvements through combining phone- and syllable-scale information in automatic speech recognition , 1998, ICSLP.

[11]  Thomas Hain,et al.  The 1997 HTK broadcast news transcription system , 1998 .

[12]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.