The fluency and continuity properties are very important in singing voice synthesis. In order to synthesize smooth and continuous singing voice, the Hidden Markov Model (HMM)-based synthesis approach is employed to build our Mandarin singing voice synthesis system. The system is designed to generate Mandarin songs with arbitrary lyrics and melodies in a certain pitch range. We also build a singing voice database for system training and synthesis, which is designed based on the phonetic converge of Mandarin speech. In addition, the acoustic feature extraction using STRAIGHT algorithm is employed to generate satisfactory vocoded singing voices. The purpose of this paper is to elaborate the construction of Mandarin singing voice synthesis system by defining the synthesis model and question set for HMM-based singing voice synthesis. In addition, we implemented two techniques, including pitch-shift pseudo data extension and vibrato post-processing, to make synthesized singing voice more natural. The proposed system framework consists of two main phases, the training phase and the synthesis phase. In the training phase, excitation, spectral and aperiodic factors are extracted from a singing voice database. The lyrics and notes of songs in the singing voice corpus are considered as contextual information for generating context-dependent label sequences. Then, the sequences are clustered with context-dependent question set and then the context-dependent HMMs are trained based on the clustered phone segments. In the synthesis phase, the input musical score and the lyric are converted into a context-dependent label sequence. The label sequence, consisting of excitation, spectrum and aperiodic factors, for the given song is constructed by concatenating the parameters generated from the context-dependent HMMs. Finally, the generated parameter sequences are synthesized using Mel Log Spectrum Approximation (MLSA) filter to generate the singing voice. The approaches used in this study are to improve the model accuracy by defining the question set, extending the singing voice database through generating pitch-shift pseudo data, and adding the vibrato singing skill using signal post-processing. The selection of question set is crucial to generate proper synthesis models. In the baseline system, the most frequently used questions of F0 and mel-cepstral clustering trees are sub-syllables types, position of note and phrase level. Since the recorded singing database is not large enough to contain each combination of contextual factors. Thus, only essential and suitable questions are defined compared to the traditional method. Besides, the extended pitch-shift pseudo data are helpful to cover the missing pitch information of sub-syllables and increase the size of the training data. Based on the analysis results of the defined pitch range (C4~B4) of the recorded singing corpus, shifting the frequency of a note too much would change the timbre. Thus, the missing pitch information of sub-syllables of the recorded corpus is compensated using the nearby notes from other songs, and shifting the frequency of signal to the corresponding Hertz by a pitch-to-frequency mapping table. The vocal vibrato is a natural oscillation of musical pitch and the singers generally employ vibrato as an expressive and musically useful aspect of the performance. So adding vibrato can make synthesized singing voice more natural and expressive. The frequency and the amplitude can be considered since the two fundamental parameters affect the singing voice with vibrato effect. The method to create vibrato is to vary the time delay periodically and use the principle of Doppler Effect. Our system implemented this phenomenon by a delay line and a low frequency oscillator (LFO) to vary the delay. For evaluation, the singing voice signals were sampled at a rate of 48 kHz and windowed by a 25ms Blackman window with a 5ms shift. Then mel-cepstral coefficients were obtained from STRAIGHTextracted spectra. The feature vectors consist of spectrum, excitation and aperiodic factor. The spectrum parameter vectors consist of 49th-order STRAIGHT mel-cepstral coefficients including the zero-th coefficient, their delta, and delta-delta coefficients. The excitation parameter vectors consist of log F0, its Proceedings of the Twenty-Fifth Conference on Computational Linguistics and Speech Processing (ROCLING 2013)
[1]
Keiichi Tokuda,et al.
Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis
,
1999,
EUROSPEECH.
[2]
Masataka Goto,et al.
Speech-to-Singing Synthesis: Converting Speaking Voices to Singing Voices by Controlling Acoustic Features Unique to Singing Voices
,
2007,
2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.
[3]
Chung-Hsien Wu,et al.
Personalized Spectral and Prosody Conversion Using Frame-Based Codeword Distribution and Adaptive CRF
,
2013,
IEEE Transactions on Audio, Speech, and Language Processing.
[4]
Hideki Kenmochi,et al.
VOCALOID - commercial singing synthesizer based on sample concatenation
,
2007,
INTERSPEECH.
[5]
Lianhong Cai,et al.
A Lyrics to Singing Voice Synthesis System with Variable Timbre
,
2011,
ICAIC.
[6]
Chung-Hsien Wu,et al.
Hierarchical prosodic pattern selection based on Fujisaki model for natural mandarin speech synthesis
,
2012,
2012 8th International Symposium on Chinese Spoken Language Processing.
[7]
Chung-Hsien Wu,et al.
Exploiting Prosody Hierarchy and Dynamic Features for Pitch Modeling and Generation in HMM-Based Speech Synthesis
,
2010,
IEEE Transactions on Audio, Speech, and Language Processing.
[8]
Qing-Cai Chen,et al.
A corpus-based concatenative Mandarin singing voice synthesis system
,
2008,
2008 International Conference on Machine Learning and Cybernetics.
[9]
Youngmoo E. Kim.
Singing voice analysis/synthesis
,
2003
.
[10]
Hung-Yan Gu,et al.
Mandarin Singing Voice Synthesis Using an HNM Based Scheme
,
2008,
2008 Congress on Image and Signal Processing.
[11]
Heiga Zen,et al.
An HMM-based singing voice synthesis system
,
2006,
INTERSPEECH.