Statistical estimation of articulatory trajectories from the speech signal using dynamical and phonological constraints
暂无分享,去创建一个
In speech science and technology, the acoustic-to-articulatory mapping is known as a difficult problem due to its non-linear and one-to-many characteristics. Over the years, different optimization techniques have been proposed to solve this problem. One of these methods is based on the extended Kalman filtering and smoothing. Although the application of this technique to vowels was promising, its extension to all classes of speech sounds has not been successful. This thesis focuses on developing and improving a statistical method of estimating the articulatory trajectories from the speech signal based on the extended Kalman filtering and smoothing.
In this study, we proposed a new way of constraining the acoustic-to-articulatory inversion by imposing high-level phonological constraints in addition to the dynamical ones. These phonological constraints were imposed by constructing different dynamical models with separate acoustic observation functions for each coproduction unit of speech consisting of two consecutive phones. Each observation sub-function was approximated in small regions by piecewise linear functions using articulatory-acoustic look-up tables. The estimation of the model parameters was based on a direct maximum-likelihood method using training articulatory-acoustic trajectories. An integrated method has been proposed in this study for the recognition of coproduction units and segmentation of the speech signal based on maximum-likelihood of the acoustic observations given different coproduction models. The likelihood of the acoustic observations given every phonological coproduction model was computed using the innovation sequences from the extended Kalman filter. The smoothed articulatory states of the corresponding model with the highest likelihood were used as the best estimate of the articulatory trajectories in every segment. Good estimation results for all classes of speech sounds have been obtained in different experiments using both synthesized and real human data.