On the (UN)importance of the contextual factors in HMM-based speech synthesis and coding

This paper presents an evaluation of the contextual factors of HMM-based speech synthesis and coding systems. Two experimental setups are proposed that are based on successive context addition from phonetic to full-context. The aim was to investigate the impact of the individual contextual factors on the speech quality. In that sense important and unimportant (i.e., not having significant impact on speech quality, also called weak) contextual factors were identified. The results imply that in speech coding the improvement in quality can be achieved just with reconstruction of syllable contexts. The sentence and utterance contexts are unimportant on the decoder side, and it is not necessary to deal with them. Although in speech coding the wider context was not necessary, in speech synthesis current syllable and utterance contexts are more important over others (previous and next word/phrase contexts).

[1]  Heiga Zen,et al.  Improving the performance of HMM-based very low bit rate speech coding , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[2]  Keiichi Tokuda,et al.  A very low bit rate speech coder using HMM-based speech recognition/synthesis techniques , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[3]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[4]  Oliver Watts,et al.  The CSTR/EMIME HTS system for Blizzard Challenge 2010 , 2010 .

[5]  Heiga Zen,et al.  Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis , 2011, Speech Commun..

[6]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[7]  Philip N. Garner,et al.  Vocal Tract Length Normalization for Statistical Parametric Speech Synthesis , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[9]  Kai Yu,et al.  Word-level emphasis modelling in HMM-based speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  METHODS FOR SUBJECTIVE DETERMINATION OF TRANSMISSION QUALITY Summary , 2022 .

[11]  Oliver Watts,et al.  The role of higher-level linguistic features in HMM-based speech synthesis , 2010, INTERSPEECH.

[12]  Xianglin Wang,et al.  An 800 bps VQ‐based LPC voice coder , 1998 .

[13]  Roy D. Patterson,et al.  Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity , 1999, EUROSPEECH.

[14]  Takashi Nose,et al.  Very low bit-rate F0 coding for phonetic vocoder using MSD-HMM with quantized F0 context , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Joan Claudi Socoró,et al.  Linguistic and mixed excitation improvements on a HMM-based speech synthesis for Castilian Spanish , 2007, SSW.