Evaluation of Singing Synthesis: Methodology and Case Study with Concatenative and Performative Systems

The special session Singing Synthesis Challenge: Fill-In the Gap aims at comparative evaluation of singing synthesis systems. The task is to synthesize a new couplet for two popular songs. This paper address the methodology needed for quality assessment of singing synthesis systems and reports on a case study using 2 systems with a total of 6 different configurations. The two synthesis systems are: a concatenative Text- to-Chant (TTC) system, including a parametric representation of the melodic curve; a Singing Instrument (SI), allowing for real-time interpretation of utterances made of flat-pitch natural voice or diphone concatenated voice. Absolute Category Rating (ACR) and Paired Comparison (PC) tests are used. Natural and natural-degraded reference conditions are used for calibration of the ACR test. The MOS obtained using ACR shows that the TTC (resp. the SI) ranks below natural voice but above (resp. in between) degraded conditions. Then singing synthesis quality is judged better than auto-tuned or distorted natural voice in some cases. PC results show that: 1/ signal processing is an important quality issue, making the difference between sys- tems; 2/ diphone concatenation degrades the quality compared to flat-pitch natural voice; 3/ Automatic melodic modelling is preferred to gestural control for off-line synthesis.

[1]  Ali Momeni,et al.  Ten years of tablet musical interfaces at CNMAT , 2007, NIME '07.

[2]  Xavier Rodet,et al.  On the Choice of Transducer Technologies for Specific Musical Functions , 2000, ICMC.

[3]  Axel Röbel,et al.  Mixed source model and its adapted vocal tract filter estimate for voice transformation and synthesis , 2013, Speech Commun..

[4]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[5]  Xavier Rodet,et al.  The CHANT Project: From the Synthesis of the Singing Voice to Synthesis in General , 1984 .

[6]  Christophe d'Alessandro,et al.  Chorus Digitalis: Experiments in Chironomic Choir Singing , 2011, INTERSPEECH.

[7]  Hideki Kenmochi,et al.  VOCALOID - commercial singing synthesizer based on sample concatenation , 2007, INTERSPEECH.

[8]  Axel Röbel A SHAPE-INVARIANT PHASE VOCODER FOR SPEECH TRANSFORMATION , 2010 .

[9]  Christophe d'Alessandro,et al.  Contrôle gestuel de la synthèse vocale. Les instruments Cantor Digitalis et Digitartic , 2015, Traitement du Signal.

[10]  Axel Röbel,et al.  Expressive Control of Singing Voice Synthesis Using Musical Contexts and a Parametric F0 Model , 2016, INTERSPEECH.

[11]  Axel Röbel,et al.  A multi-layer F0 model for singing voice synthesis using a b-spline representation with intuitive controls , 2015, INTERSPEECH.

[12]  Thierry Dutoit,et al.  Reactive and continuous control of HMM-based speech synthesis , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[13]  Stefan Huber,et al.  On the use of voice descriptors for glottal source shape parameter estimation , 2014, Comput. Speech Lang..

[15]  Axel Röbel,et al.  Phase vocoder and beyond , 2013 .

[16]  Christophe d'Alessandro,et al.  Issues and Solutions Related to Real-Time TD-PSOLA Implementation , 2010 .

[17]  Masataka Goto,et al.  Expression Control in Singing Voice Synthesis: Features, approaches, evaluation, and challenges , 2015, IEEE Signal Processing Magazine.

[18]  Christophe d'Alessandro,et al.  Realtime and accurate musical control of expression in singing synthesis , 2008, Journal on Multimodal User Interfaces.