Broad phonetic class segmentation study for Thai automatic speech recognition

An automatic broad class segmentation is an important pre-processing step in speech recognition and other speech applications, for example, the speech transcription task to support the phonetic transcription of speech corpus and pronunciation error detection of phone boundaries in language learning applications. This research is aimed at the improvement of the acoustic parameters for the Thai automatic speech recognition system. We proposed acoustic parameters that capture the characteristics of broad manner class of Thai speech. These acoustic parameters are: 1) spectral center of gravity and short time zero crossing rate to classify the silence feature and the continuant feature; and 2) the energy ratio E[0-400] to E[400-6000] to classify the syllabic feature. The results showed 28.09%, 11.0% and 2.41% error reductions for the continuant, the syllabic and the silence features, respectively, when compared to acoustic parameters used in English. The accuracy of 80.46% was obtained from the speech segmentation task and also introduced a 23.46% error reduction when compared to the baseline HMM-MFCC based broad class segmentation. We also found similar performance for word classification in the CVC context when compared to the baseline HMM-MFCC in word recognition tasks.

[1]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[2]  Atiwong Suchato,et al.  Phone boundary detection using selective refinements and context-dependent acoustic features , 2007, INTERSPEECH.

[3]  A. Juneja,et al.  Speech segmentation using probabilistic phonetic feature hierarchy and support vector machines , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[4]  Carol Y. Espy-Wilson,et al.  Knowledge-based parameters for HMM speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5]  Ariel Salomon,et al.  Detection of speech landmarks: use of temporal information. , 2004, The Journal of the Acoustical Society of America.

[6]  Atiwong Suchato,et al.  Locating phone boundaries from acoustic discontinuities using a two-staged approach , 2006, INTERSPEECH.

[7]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[8]  Carol Espy-Wilson,et al.  A probabilistic framework for landmark detection based on phonetic features for automatic speech recognition. , 2008, The Journal of the Acoustical Society of America.

[9]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[10]  James R. Glass,et al.  Real-time probabilistic segmentation for segment-based speech recognition , 1998, ICSLP.

[11]  A. Suchato,et al.  Improving Segment-based Speech Recognition by Recovering Missing Segments in Segment Graphs - A Thai Case Study , 2008, 2008 International Symposium on Communications and Information Technologies.