Discriminative acoustic model for improving mispronunciation detection and diagnosis in computer-aided pronunciation training (CAPT)

In this study, we propose a discriminative training algorithm to jointly minimize mispronunciation detection errors (i.e., false rejections and false acceptances) and diagnosis errors (i.e., correctly pinpointing mispronunciations but incorrectly stating how they are wrong). An optimization procedure, similar to Minimum Word Error (MWE) discriminative training, is developed to refine the ML-trained HMMs. The errors to be minimized are obtained by comparing transcribed training utterances (including mispronunciations) with Extended Recognition Networks [3] which contain both canonical pronunciations and explicitly modeled mispronunciations. The ERN is compiled by handcrafted rules, or data-driven rules. Several conclusions can be drawn from the experiments: (1) data-driven rules are more effective than hand-crafted ones in capturing mispronunciations; (2) compared with the ML training baseline, discriminative training can reduce false rejections and diagnostic errors, though false acceptances increase slightly due to a small number of false-acceptance samples in the training set.