An HMM-based singing voice synthesis system

Abstract The present paper describes a corpus-based singing voice syn-thesis system based on hidden Markov models (HMMs). Thissystem employs the HMM-based speech synthesis to synthesizesingingvoice. Musical information such aslyrics, tones, durationsis modeled simultaneously in a unified framework of the context-dependent HMM. It can mimic the voice quality and singing styleof the original singer. Results of a singing voice synthesis exper-iment show that the proposed system can synthesize smooth andnatural-sounding singing voice. Index Terms : singing voice synthesis, HMM, time-lag model. 1. Introduction In recent years, various applications of speech synthesis systemshave been proposed and investigated. Singing voice synthesis isone of the hot topics in this area [1–5]. However, only a fewcorpus-based singing voice synthesis systems which can be con-structed automatically have been proposed.Currently, there are two main paradigms in the corpus-basedspeech synthesis area: sample-based approach and statistical ap-proach. The sample-based approach such as unit selection [6]can synthesize high-quality speech. However, it requires a hugeamountoftrainingdatatorealizevariousvoicecharacteristics. Onthe other hand, the quality of statistical approach such as HMM-basedspeechsynthesis[7]isbuzzybecauseitisbasedonavocod-ingtechnique. However,itissmoothandstable,anditsvoicechar-acteristics can easily be modified by transforming HMM parame-ters appropriately. For singing voice synthesis, applying the unitselection seems to be difficult because a huge amount of singingspeech which covers vast combinations of contextual factors thataffect singing voice has to be recorded. On the other hand, theHMM-based system can be constructed using a relatively smallamount of training data. From this point of view, the HMM-basedapproach seems to be more suitable for the singing voice synthe-sizer. In the present paper, we apply the HMM-based synthesisapproach to singing voice synthesis.Although the singing voice synthesis system proposed in thepresent paper is quite similar to the HMM-based text-to-speechsynthesissystem[7],therearetwomaindifferencesbetweenthem.In the HMM-based text-to-speech synthesis system, contextualfactors which may affect reading speech (e.g. phonemes, sylla-bles, words, phrases, etc.) are taken into account. However, con-textual factors which may affect singing voice should be different

[1]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2]  Julius O. Smith,et al.  Toward a high-quality singing synthesizer with vocal texture control , 2002 .

[3]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[4]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[5]  Mark A. Clements,et al.  Concatenation-Based MIDI-to-Singing Voice Synthesis , 1997 .

[6]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[7]  Satoshi Imai,et al.  Cepstral analysis synthesis on the mel frequency scale , 1983, ICASSP.

[8]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Mark A. Clements,et al.  A singing voice synthesis system based on sinusoidal modeling , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.