An Approach to Proper Speech Segmentation for Quality Improvement in Concatenative Text-To-Speech System for Indian Languages

Most of the Indian-language Text-To-Speech (TTS) synthesis systems designed till date are based upon the concatenation of acoustic units. The prime challenge is the selection of proper units and their elegant concatenation. Due to the precincts of current automated techniques based on Hidden Markov Model (HMM) and Dynamic Time Warping (DTW), manual verification and labeling are often essential. Automatic placement of phoneme boundaries in a speech waveform using explicit statistical model for phoneme boundary is proposed in this paper. We are projecting the Harmonic plus Noise Model (HNM) in the first step and refine the boundary placement by searching for the best match in a region near the estimated boundary with predefined boundary model Technique like ESNOLA. This technique is applied for effective concatenation, which results in smooth output. Studies show that HNM is capable of synthesizing all vowels and diphones with good quality. This can remarkably reduce the size of the database. Further the pitch synchronous analysis and the Glottal Closure Instants (GCI) are accurately calculated. The quality of the synthesized speech improves if these units are obtained from the glottal signal rather than from processing the signal. The database has to be developed for VCV for all Indian languages as we have done for Oriya, one of the official languages of the Republic of India for our case study.