Expressive Control of Singing Voice Synthesis Using Musical Contexts and a Parametric F0 Model

Expressive singing voice synthesis requires an appropriate control of both prosodic and timbral aspects. While it is desirable to have an intuitive control over the expressive parameters, synthesis systems should be able to produce convincing results directly from a score. As countless interpretations of a same score are possible, the system should also target a particular singing style, which implies to mimic the various strategies used by different singers. Among the control parameters involved, the pitch (F0) should be modeled in priority. In previous work, a parametric F0 model with intuitive controls has been proposed, but no automatic way to choose the model parameters was given. In the present work, we propose a new approach for modeling singing style, based on parametric templates selection. In this approach, the F0 parameters and phonemes durations are extracted from annotated recordings, along with a rich description of contex-tual informations, and stored to form a database of parametric templates. This database is then used to build a model of the singing style using decision-trees. At the synthesis stage, appropriate parameters are then selected according to the target contexts. The results produced by this approach have been evaluated by means of a listening test.

[1]  Voice DB Song GENERATING SINGING VOICE EXPRESSION CONTOURS BASED ON UNIT SELECTION , 2013 .

[2]  Gunilla Berndtsson,et al.  The KTH Rule System for Singing Synthesis , 1996 .

[3]  Hirokazu Kameoka,et al.  Automatic Identification for Singing Style based on Sung Melodic Contour Characterized in Phase Plane , 2009, ISMIR.

[4]  Haizhou Li,et al.  Exploring Vibrato-Motivated Acoustic Features for Singer Identification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Chilin Shih,et al.  Prosody control for speaking and singing styles , 2001, INTERSPEECH.

[6]  Axel Röbel,et al.  A multi-layer F0 model for singing voice synthesis using a b-spline representation with intuitive controls , 2015, INTERSPEECH.

[7]  Nicolas Obin,et al.  MeLos: Analysis and Modelling of Speech Prosody and Speaking Style , 2011 .

[8]  Masataka Goto,et al.  Acoustic and perceptual effects of vocal training in amateur male singing , 2009, INTERSPEECH.

[9]  A. Roebel,et al.  Transforming Vibrato Extent in Monophonic Sounds , 2011 .

[10]  Hideki Kenmochi,et al.  Singing synthesis as a new musical instrument , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Céline Chabot-Canet Interprétation, phrasé et rhétorique vocale dans la chanson française depuis 1950 : expliciter l’indicible de la voix , 2013 .

[12]  Takashi Nose,et al.  HMM-based expressive singing voice synthesis with singing style control and robust pitch modeling , 2015, Comput. Speech Lang..

[13]  Heiga Zen,et al.  An HMM-based singing voice synthesis system , 2006, INTERSPEECH.

[14]  Katsutoshi Itoyama,et al.  Transferring Vocal Expression of F0 Contour Using Singing Voice Synthesizer , 2014, IEA/AIE.

[15]  Concha Bielza,et al.  A survey on multi‐output regression , 2015, WIREs Data Mining Knowl. Discov..

[16]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.