Frequency-domain linear prediction for temporal features

Current speech recognition systems uniformly employ short-time spectral analysis, usually over windows of 10-30 ms, as the basis for their acoustic representations. Any detail below this timescale is lost, and even temporal structures above this level are usually only weakly represented in the form of deltas etc. We address this limitation by proposing a novel representation of the temporal envelope in different frequency bands by exploring the dual of conventional linear prediction (LPC) when applied in the transform domain. With this technique of frequency-domain linear prediction (FDLP), the 'poles' of the model describe temporal, rather than spectral, peaks. By using analysis windows on the order of hundreds of milliseconds, the procedure automatically decides how to distribute poles to model the temporal structure best within the window. While this approach offers many possibilities for novel speech features, we experiment with one particular form, an index describing the 'sharpness' of individual poles within a window, and show a relatively large word error rate improvement from 4.97% to 3.81% in a recognizer trained on general conversational telephone speech and tested on a small-vocabulary spontaneous numbers task. We analyze this improvement in terms of the confusion matrices and suggest how the newly-modeled fine temporal structure may be helping.

[1]  Panu Somervuo,et al.  Feature transformations and combinations for improving ASR performance , 2003, INTERSPEECH.

[2]  Hynek Hermansky,et al.  Analysis and synthesis of speech based on spectral transform linear predictive method , 1983, ICASSP.

[3]  Daniel P. W. Ellis,et al.  Sound texture modelling with linear prediction in both time and frequency domains , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[4]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[5]  Mari Ostendorf,et al.  Cross-stream observation dependencies for multi-stream speech recognition , 2003, INTERSPEECH.

[6]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[7]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[8]  James David Johnston,et al.  Enhancing the Performance of Perceptual Audio Coders by Using Temporal Noise Shaping (TNS) , 1996 .

[9]  Ronald E. Crochiere,et al.  Frequency domain coding of speech , 1979 .