Speaker adaptation of trajectory HMMs using feature-space MLLR

Abstract Recently, a trajectory model, derived from the hiddenMarkov model (HMM) by imposing explicit relationshipsbetween static and dynamic features, has been proposed.The derived model, named trajectory HMM , can alleviatetwo limitations of the HMM: constant statistics within astate and conditional independence assumption of state out-put probabilities. In the present paper, a speaker adapta-tion algorithm for the trajectory HMM based on feature-space Maximum Likelihood Linear Regression (fMLLR)is derived and evaluated. Results of a simple continu-ous speech recognition experiment shows that adapting tra-jectory HMMs using the derived adaptation algorithm im-proves the speech recognition performance. Index Terms : trajectory HMM, adaptation, fMLLR. 1. Introduction Speech recognition technologies have achieved significantprogress with the introduction of hidden Markov models(HMMs). Their tractability and efficient implementationsare achieved by a number of assumptions, such as constantstatistics within an HMM state, conditional independenceof state output probabilities. Although these assumptionsmake the HMM practically useful, they are not realistic formodeling sequences of speech spectra, especially in spon-taneous speech. To overcome these shortcomings of theHMM, a variety of alternative models have been proposed,e.g., [1–3]. Although these models can improve the speechrecognition performance, they generally require an increaseinthenumberofmodelparametersandcomputationalcom-plexity. Alternatively, the use of dynamic features (deltaand delta-delta features) [4] also improves the performanceof HMM-based speech recognizers. It can be viewed as asimple mechanism to capture time dependencies. However,it has been thought of as an ad hoc rather than an essentialsolution. Generally, dynamic features are calculated as re-gression coefficients from their neighboring static features.Therefore, relationshipsbetweenstaticanddynamicfeaturevector sequences are deterministic. However, usually theserelationships are ignored and the static and dynamic fea-tures are modeled as independent random variables. Ignor-ing these dependencies allows inconsistency between thestatic and dynamic features when the HMM is used as agenerative model in the obvious way.Recently, a trajectory model, derived from the HMM byimposing the explicit relationships between static and dy-namic features, has been proposed [5]. The derived model,named

[1]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[2]  Heiga Zen,et al.  Reformulating the HMM as a Trajectory Model , 2004 .

[3]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[4]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[5]  全 炳河,et al.  Reformulating HMM as a trajectory model by imposing explicit relationships between static and dynamic features , 2006 .

[6]  G. Zweig,et al.  Speech recognition using dynamic Bayesian networks , 1998 .

[7]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[8]  Junichi Yamagishi,et al.  Average-Voice-Based Speech Synthesis , 2006 .

[9]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[10]  Heiga Zen,et al.  Estimating Trajectory Hmm Parameters Using Monte Carlo Em With Gibbs Sampler , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[12]  Mark J. F. Gales,et al.  Switching linear dynamical systems for speech recognition , 2003 .

[13]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[14]  Philip C. Woodland Speaker adaptation for continuous density HMMs: a review , 2001 .