Evaluation of formant-like features on an automatic vowel classification task.

Numerous attempts have been made to find low-dimensional, formant-related representations of speech signals that are suitable for automatic speech recognition. However, it is often not known how these features behave in comparison with true formants. The purpose of this study was to compare two sets of automatically extracted formant-like features, i.e., robust formants and HMM2 features, to hand-labeled formants. The robust formant features were derived by means of the split Levinson algorithm while the HMM2 features correspond to the frequency segmentation of speech signals obtained by two-dimensional hidden Markov models. Mel-frequency cepstral coefficients (MFCCs) were also included in the investigation as an example of state-of-the-art automatic speech recognition features. The feature sets were compared in terms of their performance on a vowel classification task. The speech data and hand-labeled formants that were used in this study are a subset of the American English vowels database presented in Hillenbrand et al. [J. Acoust. Soc. Am. 97, 3099-3111 (1995)]. Classification performance was measured on the original, clean data and in noisy acoustic conditions. When using clean data, the classification performance of the formant-like features compared very well to the performance of the hand-labeled formants in a gender-dependent experiment, but was inferior to the hand-labeled formants in a gender-independent experiment. The results that were obtained in noisy acoustic conditions indicated that the formant-like features used in this study are not inherently noise robust. For clean and noisy data as well as for the gender-dependent and gender-independent experiments the MFCCs achieved the same or superior results as the formant features, but at the price of a much higher feature dimensionality.

[1]  J Hillenbrand,et al.  Vowel classification based on fundamental frequency and formant frequencies. , 1987, Journal of speech and hearing research.

[2]  Katrin Weber,et al.  HMM Mixtures (HMM2) for Robust Speech Recognition , 2003 .

[3]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition: Advanced Topics , 1999 .

[4]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[5]  Li Deng,et al.  An expectation maximization approach for formant tracking using a parameter-free non-linear predictor , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[6]  R. Plomp,et al.  Perceptual and physical space of vowel sounds. , 1969, The Journal of the Acoustical Society of America.

[7]  Lei Lf Willems Robust formant analysis , 1986 .

[8]  Hervé Bourlard,et al.  Speech recognition using advanced HMM2 features , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[9]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[10]  Tjeerd Andringa,et al.  Continuity preserving signal processing , 2002 .

[11]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[12]  Lou Boves,et al.  Comparing acoustic features for robust ASR in fixed and cellular network applications , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[13]  J. Hillenbrand,et al.  Acoustic characteristics of American English vowels. , 1994, The Journal of the Acoustical Society of America.

[14]  Samy Bengio,et al.  HMM2- Extraction of Formant Features and their Use for Robust ASR , 2001 .

[15]  Hermann Ney,et al.  A model for efficient formant estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[16]  Dick Howard Esprit , 1978, Telos.

[17]  Lou Boves,et al.  Acoustic backing-off as an implementation of missing feature theory , 2001, Speech Commun..

[18]  Philippe Delsarte,et al.  The split Levinson algorithm , 1986, IEEE Trans. Acoust. Speech Signal Process..

[19]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[20]  Philip N. Garner,et al.  Using formant frequencies in speech recognition , 1997, EUROSPEECH.

[21]  Steve Young,et al.  The HTK book , 1995 .

[22]  Samy Bengio,et al.  A Pragmatic View of the Application of HMM2 for ASR , 2001 .

[23]  Samy Bengio,et al.  HMM2- extraction of formant structures and their use for robust ASR , 2001, INTERSPEECH.

[24]  Samy Bengio,et al.  Evaluation of formant-like features for ASR , 2002, INTERSPEECH.

[25]  Samy Bengio,et al.  HMM2- a novel approach to HMM emission probability estimation , 2000, INTERSPEECH.

[26]  G. E. Peterson,et al.  Control Methods Used in a Study of the Vowels , 1951 .

[27]  Philip N. Garner,et al.  On the robust incorporation of formant features into hidden Markov models for automatic speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[28]  Andrzej Drygajlo,et al.  Statistical estimation of unreliable features for robust speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[29]  Fred. D. Minifie Normal aspects of speech, hearing, and language , 1973 .

[30]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..