Tree-based context clustering using speech recognition features for acoustic model training of speech synthesis

Tree based context clustering processes reduce the sizes of acoustic models of Hidden Markov Model (HMM) speech synthesis systems as well as eliminate problems arising from unseen sound units. Representations of speech units in speech synthesis systems are often LPC or MCEP features whose characteristics promote speech reconstruction rather than discrimination among different sound units. In this paper, MFCC features, successfully utilized in speech recognition, were selected as features for generating context clustering trees applied to LPC/MCEP-based speech synthesis. On average, the collective size of acoustic models was 29% smaller than ones of typical cases while spectral features generated from a speech synthesis system using each type of clustering trees did not significantly deviate from features extracted from actual spoken utterances. Applying MFCC-based clustering tree did not significantly affect the resulting pitch and duration models of the system. We concluded that MFCC-based clustering tree can reduce the overall size of acoustic models while synthetic sound quality is maintained.

[1]  Cemal Ardil,et al.  Investigation of Combined use of MFCC and LPC Features in Speech Recognition Systems , 2007 .

[2]  Yang Wang,et al.  Extended Decision Tree with or Relationship for HMM-Based Speech Synthesis , 2013, 2013 2nd IAPR Asian Conference on Pattern Recognition.

[3]  Philip Chan,et al.  Toward accurate dynamic time warping in linear time and space , 2007, Intell. Data Anal..

[4]  Steve J. Young,et al.  Tree-Based State Tying for High Accuracy Modelling , 1994, HLT.

[5]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[6]  Supphanat Kanokphara,et al.  Phonetically Distributed Continuous Speech Corpus for Thai Language , 2002, LREC.

[7]  Hamdy K. Elminir,et al.  Evaluation of Different Feature Extraction Techniques for Continuous Speech Recognition , 2012 .

[8]  Zhi-Jie Yan,et al.  Cross-validation based decision tree clustering for HMM-based TTS , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Keiichi Tokuda,et al.  Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10]  Natthawut Kertkeidkachorn,et al.  CHULA TTS: A Modularized Text-To-Speech Framework , 2014, PACLIC.

[11]  Frank K. Soong,et al.  Cross validation and Minimum Generation Error for improved model clustering in HMM-based TTS , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[12]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.