HMM-based Mandarin Singing Voice Synthesis Using Tailored Synthesis Units and Question Sets

Fluency and continuity properties are essential in synthesizing a high quality singing voice. In order to synthesize a smooth and continuous singing voice, the Hidden Markov Model-based synthesis approach is employed in this study to construct a Mandarin singing voice synthesis system. The system is designed to generate Mandarin songs with arbitrary lyrics and melody in a certain pitch range. In this study, a singing voice database is designed and collected, considering the phonetic converge of Mandarin singing voices. Synthesis units and a question set are defined carefully and tailored the meet the minimum requirement for Mandarin singing voice synthesis. In addition, pitch-shift pseudo data extension and vibrato creation are applied to obtain more natural synthesized singing voices.The evaluation results show that the system, based on tailored synthesis units and the question set, can improve the quality and intelligibility of the synthesized singing voice. Using pitch-shift pseudo data and vibrato creation can further improve the quality and naturalness of the synthesized singing voices.

[1]  Chung-Hsien Wu,et al.  Hierarchical prosodic pattern selection based on Fujisaki model for natural mandarin speech synthesis , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[2]  Heiga Zen,et al.  Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[3]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[4]  Heiga Zen,et al.  An HMM-based singing voice synthesis system , 2006, INTERSPEECH.

[5]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[6]  Qing-Cai Chen,et al.  A corpus-based concatenative Mandarin singing voice synthesis system , 2008, 2008 International Conference on Machine Learning and Cybernetics.

[7]  Hung-Yan Gu,et al.  Mandarin Singing Voice Synthesis Using an HNM Based Scheme , 2008, 2008 Congress on Image and Signal Processing.

[8]  Youngmoo E. Kim Singing voice analysis/synthesis , 2003 .

[9]  Masataka Goto,et al.  Speech-to-Singing Synthesis: Converting Speaking Voices to Singing Voices by Controlling Acoustic Features Unique to Singing Voices , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[10]  Chung-Hsien Wu,et al.  Personalized Spectral and Prosody Conversion Using Frame-Based Codeword Distribution and Adaptive CRF , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Hideki Kenmochi,et al.  VOCALOID - commercial singing synthesizer based on sample concatenation , 2007, INTERSPEECH.

[12]  Lianhong Cai,et al.  A Lyrics to Singing Voice Synthesis System with Variable Timbre , 2011, ICAIC.

[13]  Ren-Hua Wang,et al.  The USTC System for Blizzard Challenge 2010 , 2008 .

[14]  Udo Zoelzer,et al.  DAFX: Digital Audio Effects , 2011 .

[15]  Chung-Hsien Wu,et al.  Variable-Length Unit Selection in TTS Using Structural Syntactic Cost , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Chung-Hsien Wu,et al.  Exploiting Prosody Hierarchy and Dynamic Features for Pitch Modeling and Generation in HMM-Based Speech Synthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Yu Shi,et al.  Segmental tonal modeling for phone set design in Mandarin LVCSR , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.