ABSTRACT Our proposed paradigm for automatic assessment of pronunciationquality uses hidden Markov models (HMMs) to generate phoneticsegmentations of the student’s speech. From these segmentations,we use the HMMs to obtain spectral match and duration scores. Inthis work we focus on the problem of calibrating different machinescores to obtain an accurate prediction of the grades that a humanexpert would assign to the pronunciation. We discuss theapplication of different approaches based on minimum mean squareerror (MMSE) estimation and Bayesian classification. Weinvestigate the characteristics of the different mappings as well asthe effects of the prior distribution of grades in the calibrationdatabase. We finally suggest a simple method to extrapolatemappings from one language to another. 1. INTRODUCTION This work is part of an effort aimed at developing computer-basedsystems for language instruction; we address the task of grading thepronunciation quality of the speech of a student of a foreignlanguage. The automatic grading system uses an HMM-basedcontinuous speech recognition system [1] to generate phoneticsegmentations. Based on these segmentations and probabilisticmodels we produce different pronunciation scores for individual orgroups of sentences that can be used as predictors of thepronunciation quality. Different types of these machine scores canbe combined to obtain a better estimation of the overallpronunciation quality. In this work we discuss the application ofseveral methods to obtain and calibrate the mapping from themachine scores to the pronunciation quality grades that a humanexpert would have given. Treating these human grades and machinescores as random variables, the pronunciation evaluation problemcan be considered as an estimation problem, where we try toestimate, or predict, the value of the human grade by using a set ofpredictors. These predictors are the machine scores that we havepresented in our previous work [2],[3],[5].We investigate the use of MMSE estimation and classificationmethods to predict the human grade from a set of machine scores.We present alternative implementations of these methods based onnonparametric techniques. We illustrate their application using apronunciation-quality-graded database of nonnative Spanish. Wealso investigate the effect of the grade priors on the mappings.Finally, we suggest a simple method to extrapolate calibratedmappings from one language to another.
[1]
Yoon Kim,et al.
Automatic pronunciation scoring for language instruction
,
1997,
1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.
[2]
Leo Breiman,et al.
Classification and Regression Trees
,
1984
.
[3]
Vassilios Digalakis,et al.
Combination of machine scores for automatic grading of pronunciation quality
,
2000,
Speech Commun..
[4]
Vassilios Digalakis,et al.
Genones: optimizing the degree of mixture tying in a large vocabulary hidden Markov model based speech recognizer
,
1994,
Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.
[5]
Richard Lippmann,et al.
Neural Network Classifiers Estimate Bayesian a posteriori Probabilities
,
1991,
Neural Computation.
[6]
Mitch Weintraub,et al.
Automatic text-independent pronunciation scoring of foreign language student speech
,
1996,
Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.