Abstract The present paper describes a corpus-based singing voice syn-thesis system based on hidden Markov models (HMMs). Thissystem employs the HMM-based speech synthesis to synthesizesingingvoice. Musical information such aslyrics, tones, durationsis modeled simultaneously in a unified framework of the context-dependent HMM. It can mimic the voice quality and singing styleof the original singer. Results of a singing voice synthesis exper-iment show that the proposed system can synthesize smooth andnatural-sounding singing voice. Index Terms : singing voice synthesis, HMM, time-lag model. 1. Introduction In recent years, various applications of speech synthesis systemshave been proposed and investigated. Singing voice synthesis isone of the hot topics in this area [1–5]. However, only a fewcorpus-based singing voice synthesis systems which can be con-structed automatically have been proposed.Currently, there are two main paradigms in the corpus-basedspeech synthesis area: sample-based approach and statistical ap-proach. The sample-based approach such as unit selection [6]can synthesize high-quality speech. However, it requires a hugeamountoftrainingdatatorealizevariousvoicecharacteristics. Onthe other hand, the quality of statistical approach such as HMM-basedspeechsynthesis[7]isbuzzybecauseitisbasedonavocod-ingtechnique. However,itissmoothandstable,anditsvoicechar-acteristics can easily be modified by transforming HMM parame-ters appropriately. For singing voice synthesis, applying the unitselection seems to be difficult because a huge amount of singingspeech which covers vast combinations of contextual factors thataffect singing voice has to be recorded. On the other hand, theHMM-based system can be constructed using a relatively smallamount of training data. From this point of view, the HMM-basedapproach seems to be more suitable for the singing voice synthe-sizer. In the present paper, we apply the HMM-based synthesisapproach to singing voice synthesis.Although the singing voice synthesis system proposed in thepresent paper is quite similar to the HMM-based text-to-speechsynthesissystem[7],therearetwomaindifferencesbetweenthem.In the HMM-based text-to-speech synthesis system, contextualfactors which may affect reading speech (e.g. phonemes, sylla-bles, words, phrases, etc.) are taken into account. However, con-textual factors which may affect singing voice should be different
[1]
Alan W. Black,et al.
Unit selection in a concatenative speech synthesis system using a large speech database
,
1996,
1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.
[2]
Julius O. Smith,et al.
Toward a high-quality singing synthesizer with vocal texture control
,
2002
.
[3]
Keiichi Tokuda,et al.
Speech parameter generation algorithms for HMM-based speech synthesis
,
2000,
2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).
[4]
Keiichi Tokuda,et al.
Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis
,
1999,
EUROSPEECH.
[5]
Mark A. Clements,et al.
Concatenation-Based MIDI-to-Singing Voice Synthesis
,
1997
.
[6]
Jj Odell,et al.
The Use of Context in Large Vocabulary Speech Recognition
,
1995
.
[7]
Satoshi Imai,et al.
Cepstral analysis synthesis on the mel frequency scale
,
1983,
ICASSP.
[8]
Keiichi Tokuda,et al.
An adaptive algorithm for mel-cepstral analysis of speech
,
1992,
[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.
[9]
Mark A. Clements,et al.
A singing voice synthesis system based on sinusoidal modeling
,
1997,
1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.