Gaussian process dynamical models for nonparametric speech representation and synthesis

We propose Gaussian process dynamical models (GPDMs) as a new, nonparametric paradigm in acoustic models of speech. These use multidimensional, continuous state-spaces to overcome familiar issues with discrete-state, HMM-based speech models. The added dimensions allow the state to represent and describe more than just temporal structure as systematic differences in mean, rather than as mere correlations in a residual (which dynamic features or AR-HMMs do). Being based on Gaussian processes, the models avoid restrictive parametric or linearity assumptions on signal structure. We outline GPDM theory, and describe model setup and initialization schemes relevant to speech applications. Experiments demonstrate subjectively better quality of synthesized speech than from comparable HMMs. In addition, there is evidence for unsupervised discovery of salient speech structure.

[1]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[2]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[3]  Heiga Zen,et al.  Hidden semi-Markov model based speech synthesis , 2004, INTERSPEECH.

[4]  David J. Fleet,et al.  Gaussian Process Dynamical Models , 2005, NIPS.

[5]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[6]  D. Lawrence The Gaussian Process Latent Variable Model , 2006 .

[7]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[8]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[9]  David J. Fleet,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .

[10]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[11]  Salil Deena,et al.  Visual speech synthesis by modelling coarticulation dynamics using a non-parametric switching state-space model , 2010, ICMI-MLMI '10.

[12]  Gustav Eje Henter,et al.  Intermediate-State HMMs to Capture Continuously-Changing Signal Features , 2011, INTERSPEECH.

[13]  Heiga Zen,et al.  The Effect of Using Normalized Models in Statistical Speech Synthesis , 2011, INTERSPEECH.

[14]  Hyunsin Park Gaussian Process Dynamical Models for Phoneme Classification , 2011 .