The performance of the Mel-Frequency Cepstrum Coefficients (MFCC) may be affected by (1) the number of filters, (2) the shape of filters, (3) the way in which filters are spaced, and (4) the way in which the power spectrum is warped. In this paper, several comparison experiments are done to find a best implementation. The traditional MFCC calculation excludes the 0th coefficient for the reason that it is regarded as somewhat unreliable. According to the analysis and experiments, the authors find that it can be regarded as the generalized frequency band energy (FBE) and is hence useful, which results in the FBE-MFCC. The authors also propose a better analysis, namely the auto-regressive analysis, on the frame energy, which outperform its 1st and/or 2nd order differential derivatives. Experiments with the “863” Speech Database show that, compared with the traditional MFCC with its corresponding auto-regressive analysis coefficients, the FBE-MFCC and the frame energy with their corresponding auto-regressive analysis coefficients form the best combination, reducing the Chinese syllable error rate (CSER) by about 10%, while the FBE-MFCC with the corresponding auto-regressive analysis coefficients reduces CSER by 2.5%. Comparison experiments are also done with a quite casual Chinese speech database, named Chinese Annotated Spontaneous Speech (CASS) corpus. The FBE-MFCC can reduce the error rate by about 2.9% on an average.
[1]
Louis C. W. Pols,et al.
Spectral analysis and identification of Dutch vowels in monosyllabic words
,
1977
.
[2]
Zheng Fang,et al.
ON THE EMBEDDED MULTIPLE-MODEL SCORING SCHEME FOR SPEECH RECOGNITION
,
1998
.
[3]
Joseph Picone,et al.
Signal modeling techniques in speech recognition
,
1993,
Proc. IEEE.
[4]
H Hermansky,et al.
Perceptual linear predictive (PLP) analysis of speech.
,
1990,
The Journal of the Acoustical Society of America.
[5]
Wu Hua,et al.
An application of SAMPA-c for standard Chinese
,
2000,
INTERSPEECH.
[6]
Mei-Yuh Hwang,et al.
From Sphinx-II to Whisper — Making Speech Recognition Usable
,
1996
.
[7]
Sadaoki Furui,et al.
Speaker-independent isolated word recognition using dynamic features of speech spectrum
,
1986,
IEEE Trans. Acoust. Speech Signal Process..
[8]
Wu Hua,et al.
The phonetic labeling on read and spontaneous discourse corpora
,
2000,
INTERSPEECH.
[9]
E. Zwicker,et al.
Subdivision of the audible frequency range into critical bands
,
1961
.
[10]
William J. Byrne,et al.
CASS: a phonetically transcribed corpus of mandarin spontaneous speech
,
2000,
INTERSPEECH.
[11]
Stan Davis,et al.
Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se
,
1980
.