Implementations of synthesis models for speech and singing

The current implementations of the synthesis models for speech and singing are described. An improved model for speech is presented and compared to the model currently in use. A new singing synthesis model has recently been implemcn~ed in a signal-processing board. The differences between these models are pointed out. Test results from comparative measurements on synthetic speech synthesis arc also presented. Future improvements of both speech and singing synthesis are discussed. INTRODUCTION In a previous paper (Neovius, 1989), a digital signal-processing (DSP) development environment for the Texas Instruments TMS 320C25 chip and its application to the real-time implementation of the OVE 111 model was presented. In this paper, the on-going work with this model and its implementation will be discussed, and comparative measurements of the speech models will be presented. The OVE 111 model is a part of a text-to-speech system using the RULSYS environment (Carlson & Granstriim, 1975). RULSYS is also used for synthesis of singing. The first version of the Music and Singing Synthesis Equipment (MUSSE) was built as an analog implementation of a synthesis model similar to the classical OVE I1 synthesizer (Larsson, 1977). Synthesis of VCV-syllables using this model is described in Zera, Gauffin, & Sundberg, (1984). The analog MUSSE synthesizer has been a very useful tool in our research on singing (FrydCn, Sundberg, & Askenfelt, 1982; Sundberg, 1989), but a new, more versatile singing synthesizer was needed. For instance, the smallest pitch step in analog MUSSE is 6.25 cents, whereas, for musical purposes, a much finer pitch quantization is required (1 cent or less). Therefore, the best solution was to implement the singing synthesis in a digital signal processor, similar to OVE 111 for speech. Now that the third generation of floating-point digital signal processors has arrived, the singing synthesis requirements can be met. A development environment for the Texas Instruments TMS 320C30 chip has now been created and applied to a realtime implementation of the new MUSSE DIG synthesizer. SPEECH SYNTHESIS IMPLEMENTATION A text-to-speech system has been developed over several years (Carlson, Granstrom, & Hunnicutt, 1990). It uses a version of the OVE I11 cascade formant filter synthesis model and a simplified glottal pulse source, with only one control parameter, the fundamental frequency, FO (Fig. 1). The model is implemented in a NEC 7720 digital signal processor. This has resulted in a product with an acceptable male voice quality and less acceptable female and child voices. Continuous work on speech analysis and on synthesis rules has improved the quality of the existing text-to-speech system, but the possibilities of further enhancements are somewhat limited by the built-in limitations of the synthesis model implementation. The NEC implementation of the realtime synthesizer demands a fixed set of parameters, thus making it difficult to test enhanced models. Several new features are included in the new implementation. One is the use of the LF-glottal source model (Fant, Liljencrants, & Lin, 1985), developed at our department. Tests of this glottal source have shown that a more natural voice and a more realistic female and child voice can be achieved (Carlson, Fant, Gobl, Granstrom, Karlsson, & Lin, 1989). Another important modification is the improved nasal branch (Carlson, Granstrom, & Nord, 1990). Work has also been done *Names in alphabetic order.