Effectiveness of PLP-based phonetic segmentation for speech synthesis

In this paper, use of Viterbi-based algorithm and spectral transition measure (STM)-based algorithm for the task of speech data labeling is being attempted. In the STM framework, we propose use of several spectral features such as recently proposed cochlear filter cepstral coefficients (CFCC), perceptual linear prediction cepstral coefficients (PLPCC) and RelAtive SpecTrAl (RASTA)-based PLPCC in addition to Mel frequency cepstral coefficients (MFCC) for phonetic segmentation task. To evaluate effectiveness of these segmentation algorithms, we require manual accurate phoneme-level labeled data which is not available for low resourced languages such as Gujarati (one of the official languages of India). In order to measure effectiveness of various segmentation algorithms, HMM-based speech synthesis system (HTS) for Gujarati has been built. From the subjective and objective evaluations, it is observed that Viterbi-based and STM with PLPCC-based segmentation algorithms work better than other algorithms.

[1]  Odette Scharenborg,et al.  Unsupervised speech segmentation: an analysis of the hypothesized phone boundaries. , 2010, The Journal of the Acoustical Society of America.

[2]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[3]  Paul Y. Chan,et al.  Segmentation of Speech Signals in Template-based Speech to Singing Conversion , 2011 .

[4]  Odette Scharenborg,et al.  Finding Maximum Margin Segments in Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[6]  Hemant A. Patil,et al.  Phonetic Transcription of Fricatives and Plosives for Gujarati and Marathi Languages , 2012, 2012 International Conference on Asian Language Processing.

[7]  Lawrence R. Rabiner,et al.  On the Relation between Maximum Spectra Boundaries , 2006 .

[8]  Yashesh Gaur,et al.  Algorithms for speech segmentation at syllable-level for text-to-speech synthesis system in Gujarati , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[9]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[11]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[12]  S. Imai,et al.  Mel Log Spectrum Approximation (MLSA) filter for speech synthesis , 1983 .

[13]  Hemant A. Patil,et al.  Use of PLP Cepstral Features for Phonetic Segmentation , 2013, 2013 International Conference on Asian Language Processing.

[14]  Hema A. Murthy,et al.  A common attribute based unified HTS framework for speech synthesis in Indian languages , 2013, SSW.

[15]  Hema A Murthy,et al.  Development and evaluation of unit selection and HMM-based speech synthesis systems for Tamil , 2013, 2013 National Conference on Communications (NCC).

[16]  Qi Li,et al.  An Auditory-Based Feature Extraction Algorithm for Robust Speaker Identification Under Mismatched Conditions , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  S. Furui On the role of spectral transition for speech perception. , 1986, The Journal of the Acoustical Society of America.

[18]  S. R. Mahadeva Prasanna,et al.  A syllable-based framework for unit selection synthesis in 13 Indian languages , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[19]  Douglas D. O'Shaughnessy,et al.  A new approach for phoneme segmentation of speech signals , 2007, INTERSPEECH.

[20]  Martine Grice,et al.  The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences , 1996, Speech Commun..

[21]  Anna Esposito,et al.  A new text-independent method for phoneme segmentation , 2001, Proceedings of the 44th IEEE 2001 Midwest Symposium on Circuits and Systems. MWSCAS 2001 (Cat. No.01CH37257).

[22]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[23]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[24]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[25]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.