Modeling Cantonese pronunciation variation by acoustic model refinement

Pronunciation variations can be roughly classified into two types: a phone change or a sound change [1][2]. A phone change happens when a canonical phone is produced as a different phone. Such a change can be modeled by converting the baseform (standard) phone to a surfaceform (actual) phone. A sound change happens at a lower, phonetic or subphonetic level within a phone and it cannot be modeled well by either the baseform or the surfaceform phone alone. We propose here to refine the acoustic models to cope with sound changes by (1) sharing the Gaussian mixture components of HMM states in the baseform and the surfaceform models; (2) adapting the mixture components of the baseform models towards those of the surfaceform models; (3) selectively reconstructing new acoustic models through sharing or adapting. The proposed pronunciation modeling algorithms are generic and can, in principle, be applied to different languages. Specifically, they were tested in a Cantonese speech recognition database. Relative word error rate reductions of 5.45%, 2.53%, and 3.04% have been achieved using the three approaches, respectively.