Joint estimation of glottal source and vocal tract for vocal synthesis using Kalman smoothing and EM algorithm

In this paper, a joint parameter estimation of the derivative glottal source waveform and the vocal tract filter is presented where aspiration noise and observation noise are taken into account within a state-space model. The Rosenberg-Klatt glottal model is used in conjunction with an all-pole filter to model voice production. The EM algorithm is employed to iteratively estimate the model parameters in a maximum-likelihood sense, utilizing a Kalman smoother in the expectation step. The model and estimator allow for improved estimates of model parameters for resynthesis, yielding an output which sounds natural and remains flexible for modification, a desirable property for expressive vocal synthesis.

[1]  Julius O. Smith,et al.  Toward a high-quality singing synthesizer with vocal texture control , 2002 .

[2]  D. Klatt,et al.  Analysis, synthesis, and perception of voice quality variations among female and male talkers. , 1990, The Journal of the Acoustical Society of America.

[3]  Perry R. Cook,et al.  Identification Of Control Parameters In An Articulatory Vocal Tract Model, With Applications To The Synthesis Of Singing , 1990 .

[4]  Gunnar Fant,et al.  The voice source in connected speech , 1997, Speech Commun..

[5]  Hideki Kasuya,et al.  A novel approach to the estimation of voice source and vocal tract parameters from speech signals , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[6]  T.H. Crystal,et al.  Linear prediction of speech , 1977, Proceedings of the IEEE.

[7]  J. Liljencrants,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .

[8]  Ehud Weinstein,et al.  Iterative and sequential Kalman filter-based speech enhancement algorithms , 1998, IEEE Trans. Speech Audio Process..

[9]  H. Strube,et al.  SIM--simultaneous inverse filtering and matching of a glottal flow model for acoustic speech signals. , 2001, The Journal of the Acoustical Society of America.

[10]  Iickho Song,et al.  Robust estimation of AR parameters and its application for speech enhancement , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.