Automatic Phonetic Segmentation Using HMM Model

HMM models are widely used in the automatic speech recognition system to segment text-to-speech(TTS) units in the forced alignment mode.To improve the segmentation performance,the optimal acoustic feature selection and the training condition of the HMM model are discussed.Experimental results show that the static 12-D Mel-frequency cepstral coefficient(MFCC) feature is the optimal acoustic feature;the optimal number of Gaussian mixture components per state is 1;the optimal number of tied states after model clustering by the classification and regreession tree(CART) is about 3 000 for speaker-dependent tri-phone HMM models.With optimized parameters,the segmentation accuracy on English test corpus is increased from 77.3% to 85.4%.