In this paper we seek to streamline various operations within the front end of a speech recognizer, both to reduce unnecessary computation and to simplify the conceptual framework. First, a novel view of the front end in terms of linear transformations is presented. Then we study the invariance property of recognition performance with respect to linear transformations (LT) at the front end. Analysis reveals that several LT steps can be consolidated into a single LT, which effectively eliminates the Discrete Cosine Transform (DCT) step, part of the traditional MFCC (Mel-Frequency Cepstral Coefficient) front end. Moreover, a highly simplified, data-driven front-end scheme is proposed as a direct generalization of this idea. The new setup has no Mel-scale filtering, another part of the MFCC front end. Experimental results show a 5% relative improvement on the Broadcast News task. 1. LINEAR TRANSFORMATIONS IN THE TRADITIONAL FRONT END The front end is a relatively independent component of a speech recognition system. Although the actual acoustic model parameters depend directly upon front-end parameterization, researchers tend to view it as a black box. When testing several different front ends, the acoustic model st ructure is seldom altered: it is simply a matter of plugging in another front end, re-estimating model parameters, and finally choosing the one that yields the lowest WER (Word Error Rate). It is important to realize, however, that front-end design and acoustic modeling are closely coupled. Below we will go through a typical front end commonly seen in most LVCSR systems, with an emphasis on connections between the two components: 1. First, the Fourier spectrum is warped to compensate for gender/speaker differences (Vocal Tract Length Normalization, or VTLN). 2. The warped spectrum is then smoothed by integrating over triangular bins arranged along a non-linear 1Here VTLN and , steps are not shown for simplicity. DCT LDA CMN
[1]
George Saon,et al.
Maximum likelihood discriminant feature spaces
,
2000,
2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).
[2]
Mark J. F. Gales.
Semi-tied covariance matrices
,
1998,
Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).
[3]
Puming Zhan,et al.
Progress in Broadcast News transcription at Dragon Systems
,
1999,
1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).
[4]
Andreas G. Andreou,et al.
Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition
,
1997
.
[5]
Ramesh A. Gopinath,et al.
Maximum likelihood modeling with Gaussian distributions for classification
,
1998,
Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).