Phone boundary detection using selective refinements and context-dependent acoustic features

Accurate placement of phone boundaries results in better performance of speech recognition systems as well as in the quality of concatenative speech synthesis. This study proposes a post-processing technique to refine the locations of phone boundaries provided by HMM-based forced alignment. The context-dependent Linear Discriminant Analysis (LDA) classifiers together with a confidence scoring scheme are utilized to improve the precision of locating phone boundaries. Every acoustic feature is not always suitable for locating boundaries between every type of phonetic segment. Therefore, feature selections are performed based on the boundary types. The proposed context-dependent refinement results in a 43.9% error reduction in locating phone boundaries compared to the ones obtained from an HMM-based force alignment. The average deviation, from manually labeled boundaries, is reduced from 1.4 to 1.0 frame when the frame size used is 10 milliseconds.

[1]  Ki-Seung Lee MLP-based phone boundary refining for a TTS database , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Lawrence R. Rabiner,et al.  On the Relation between Maximum Spectra Boundaries , 2006 .

[3]  Doroteo Torre Toledano Neural network boundary refining for automatic speech segmentation , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[4]  Hsin-Min Wang,et al.  Phonetic Boundary Refinement using Support Vector Machine , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Luís C. Oliveira,et al.  Improving the Accuracy of the Speech Synthesis Based Phonetic Alignment Using Multiple Acoustic Features , 2003, PROPOR.

[6]  Shrikanth S. Narayanan,et al.  Refined speech segmentation for concatenative speech synthesis , 2002, INTERSPEECH.

[7]  Zhigang Cao,et al.  Refining segmental boundaries for TTS database using fine contextual-dependent boundary models , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Atiwong Suchato,et al.  Locating phone boundaries from acoustic discontinuities using a two-staged approach , 2006, INTERSPEECH.