Glottal source modeling for voice conversion

Abstract This paper describes recent advances in glottal source modeling for speech synthesis. In particular two procedures for modeling the glottal excitation waveform are described and applied to voice conversion. One model uses a polynomial to represent the glottal excitation waveform for one pitch period. The coefficients of the polynomial model form a vector that is used to design a glottal excitation code book with 32 entries for voiced excitation. The codebook is designed and trained using two sentences spoken by different speakers. Speech is synthesized using a quantized glottal excitation waveform for one speaker as the excitation for a glottal excitation linear predictive (GELP) synthesizer designed using tract parameters obtained from the speech of another speaker. Our implementation of the LP synthesizer is patterned after both a pitch-excited LP speech synthesizer and a code excited linear predictive (CELP) speech coder. In addition to the glottal excitation codebook, we use a stochastic codebook with 256 entries for unvoiced noise excitation. Analysis techniques are described for constructing both codebooks. The GELP synthesizer, which resynthesizes speech with high quality, provides the speech scientist with a simple speech synthesis procedure that uses established analysis techniques, that is able to reproduce all speech sounds, and yet also has an excitation model waveform that is related to the derivative of the glottal flow and the integral of the residue. Another approach uses the LF glottal volume-velocity waveform to model the characteristics of three voice types: modal, breathy, and vocal fry (creaky). We then convert a modal voice to sound like a breathy or vocal fry voice using the vocal tract characteristics for modal voice and the glottal volume-velocity waveform model for breathy and vocal fry voices as the excitation.

[1]  Isabel Trancoso,et al.  CELP and sinusoidal coders: Two solutions for speech coding at 4.8-9.6 kbps , 1990, Speech Commun..

[2]  Joseph P. Olive Mixed spectral representation—Formants and linear predictive coding , 1992 .

[3]  W. Bastiaan Kleijn,et al.  Encoding speech using prototype waveforms , 1993, IEEE Trans. Speech Audio Process..

[4]  Inger Karlsson,et al.  Female voices in speech synthesis , 1991 .

[5]  W. Strong,et al.  A model for the synthesis of natural sounding vowels , 1983 .

[6]  Donald G. Childers,et al.  Formant speech synthesis: improving production quality , 1989, IEEE Trans. Acoust. Speech Signal Process..

[7]  D. Childers,et al.  Acoustic correlates of vocal quality. , 1990, Journal of speech and hearing research.

[8]  Ke Wu,et al.  Quality of speech produced by analysis-synthesis , 1990, Speech Commun..

[9]  Gunnar Fant,et al.  Some problems in voice source analysis , 1993, Speech Commun..

[10]  Taikang Ning,et al.  Power spectrum estimation via orthogonal transformation , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[11]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[12]  Saito,et al.  Fundamentals of Speech Signal Processing , 1986 .

[13]  Donald G. Childers,et al.  Glottal sensing for speech analysis and synthesis , 1983, ICASSP.

[14]  R. Gray,et al.  Speech coding based upon vector quantization , 1980, ICASSP.

[15]  Michael Savic,et al.  Voice personality transformation , 1991, Digit. Signal Process..

[16]  Inger Karlsson Glottal wave forms for normal female speakers , 1986 .

[17]  Hiroya Fujisaki,et al.  Proposal and evaluation of models for the glottal source waveform , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Rolf Carlson,et al.  Experiments with voice modelling in speech synthesis , 1991, Speech Commun..

[19]  D. Klatt,et al.  Analysis, synthesis, and perception of voice quality variations among female and male talkers. , 1990, The Journal of the Acoustical Society of America.

[20]  M. Schultheiss,et al.  On the performance of CELP algorithms for low rate speech coding , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[21]  D.G. Childers,et al.  Measuring and modeling vocal source-tract interaction , 1994, IEEE Transactions on Biomedical Engineering.

[22]  D G Childers,et al.  Modeling the glottal volume-velocity waveform for three voice types. , 1995, The Journal of the Acoustical Society of America.

[23]  George S. Kang,et al.  Improvement of the excitation source in the narrow-band linear prediction vocoder , 1985, IEEE Trans. Acoust. Speech Signal Process..

[24]  Richard V. Cox,et al.  Spectral quantization and interpolation for CELP coders , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[25]  D G Childers,et al.  Speech synthesis by glottal excited linear prediction. , 1994, The Journal of the Acoustical Society of America.

[26]  Inger Karlsson Modelling voice variations in female speech synthesis , 1992, Speech Commun..

[27]  P H Milenkovic Voice source model for continuous control of pitch period. , 1993, The Journal of the Acoustical Society of America.

[28]  D G Childers,et al.  Vocal quality factors: analysis, synthesis, and perception. , 1991, The Journal of the Acoustical Society of America.

[29]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[30]  C.-K. Chan,et al.  Maximum descent method for image vector quantisation , 1991 .

[31]  Dennis H. Klatt,et al.  Software for a cascade/parallel formant synthesizer , 1980 .

[32]  Inger Karlsson Voice source dynamics for female speakers , 1990, ICSLP.

[33]  D G Childers,et al.  Gender recognition from speech. Part II: Fine analysis. , 1991, The Journal of the Acoustical Society of America.