Speech-to-Singing Synthesis: Converting Speaking Voices to Singing Voices by Controlling Acoustic Features Unique to Singing Voices

This paper describes a speech-to-singing synthesis system that can synthesize a singing voice, given a speaking voice reading the lyrics of a song and its musical score. The system is based on the speech manipulation system STRAIGHT and comprises three models controlling three acoustic features unique to singing voices: the fundamental frequency (F0), phoneme duration, and spectrum. Given the musical score and its tempo, the F0 control model generates the F0 contour of the singing voice by controlling four types of F0 fluctuations: overshoot, vibrato, preparation, and fine fluctuation. The duration control model lengthens the duration of each phoneme in the speaking voice by considering the duration of its musical note. The spectral control model converts the spectral envelope of the speaking voice into that of the singing voice by controlling both the singing formant and the amplitude modulation of formants in synchronization with vibrato. Experimental results show that the proposed system can convert speaking voices into singing voices whose naturalness is almost the same as actual singing voices.

[1]  P. Oncley Frequency, Amplitude, and Waveform Modulation in the Vocal Vibrato , 1971 .

[2]  J. Sundberg Articulatory interpretation of the "singing formant". , 1974, The Journal of the Acoustical Society of America.

[3]  Howard B. Rothman,et al.  Acoustic variability in vibrato and its perceptual significance , 1987 .

[4]  J. Sundberg,et al.  The Science of Singing Voice , 1987 .

[5]  M. Bunch,et al.  Dynamics of the Singing Voice , 1982, Springer Vienna.

[6]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[7]  Hironori Kitakaze,et al.  Perception of synthesized singing voices with fine fluctuations in their fundamental frequency contours , 2000, INTERSPEECH.

[8]  William H. Press,et al.  Numerical recipes in C , 2002 .

[9]  I. Nakayama,et al.  Comparative studies on vocal expressions in Japanese traditional and Western classical-style singing using common verse , 2004 .

[10]  Keikichi Hirose,et al.  Prosodic Modeling of Nagauta Singing and Its Evaluation , 2004 .

[11]  Masashi Unoki,et al.  Analysis of acoustic features affecting "singing-ness" and its application to singing-voice synthesis from speaking-voice , 2004, INTERSPEECH.

[12]  Masashi Unoki,et al.  Development of an F0 control model based on F0 dynamic characteristics for singing-voice synthesis , 2005, Speech Commun..

[13]  Heiga Zen,et al.  An HMM-based singing voice synthesis system , 2006, INTERSPEECH.

[14]  Masataka Goto,et al.  On human capability and acoustic cues for discriminating singing and speaking voices , 2006 .

[15]  J. Bonada,et al.  Synthesis of the Singing Voice by Performance Sampling and Spectral Models , 2007, IEEE Signal Processing Magazine.