Continuous Stochastic Feature Mapping Based on Trajectory HMMs

This paper proposes a technique of continuous stochastic feature mapping based on trajectory hidden Markov models (HMMs), which have been derived from HMMs by imposing explicit relationships between static and dynamic features. Although Gaussian mixture model (GMM)- or HMM-based feature-mapping techniques work effectively, their accuracy occasionally degrades due to inappropriate dynamic characteristics caused by frame-by-frame mapping. While the use of dynamic-feature constraints at the mapping stage can alleviate this problem, it also introduces inconsistencies between training and mapping. The technique we propose can eliminate these inconsistencies while retaining the benefits of using dynamic-feature constraints, and it offers entire sequence-level transformation rather than frame-by-frame mapping. The results obtained from speaker-conversion, acoustic-to-articulatory inversion-mapping, and noise-compensation experiments demonstrated that our new approach outperformed the conventional one.

[1]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[2]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[3]  Satoshi Imai,et al.  Cepstral analysis synthesis on the mel frequency scale , 1983, ICASSP.

[4]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[5]  John-Paul Hosom,et al.  Formant re-synthesis of dysarthric speech , 2004, SSW.

[6]  K. Shikano,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[7]  Masaaki Honda,et al.  Estimation of articulatory movements from speech acoustics using an HMM-based speech production model , 2004, IEEE Transactions on Speech and Audio Processing.

[8]  Korin Richmond,et al.  Trajectory Mixture Density Networks with Multiple Mixtures for Acoustic-Articulatory Inversion , 2007, NOLISP.

[9]  Tomoki Toda,et al.  Speaking aid system for total laryngectomees using voice conversion of body transmitted artificial speech , 2006, INTERSPEECH.

[10]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[11]  Xuedong Huang,et al.  Air- and bone-conductive integrated microphones for robust speech detection and enhancement , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[12]  Keiichi Tokuda,et al.  An algorithm for speech parameter generation from continuous mixture HMMs with dynamic features , 1995, EUROSPEECH.

[13]  K. Koishida,et al.  Vector quantization of speech spectral parameters using statistics of dynamic features , 1997 .

[14]  Heiga Zen,et al.  Model-space MLLR for trajectory HMMs , 2007, INTERSPEECH.

[15]  Korin Richmond,et al.  Estimating articulatory parameters from the acoustic speech signal , 2002 .

[16]  Jia Liu,et al.  Voice conversion with smoothed GMM and MAP adaptation , 2003, INTERSPEECH.

[17]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[18]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Xiaodong Cui,et al.  MMSE-based stereo feature stochastic mapping for noise robust speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Li Deng,et al.  High-performance robust speech recognition using stereo training data , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[21]  Le Zhang,et al.  Acoustic-Articulatory Modeling With the Trajectory HMM , 2008, IEEE Signal Processing Letters.

[22]  S. Renals,et al.  Acoustic-Articulatory Modelling with the Trajectory HMM , 2007 .

[23]  Simon King,et al.  Modelling the uncertainty in recovering articulation from acoustics , 2003, Comput. Speech Lang..

[24]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[25]  Hyung Soon Kim,et al.  Narrowband to wideband conversion of speech using GMM based transformation , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[26]  Tomoki Toda,et al.  NAM-to-speech conversion with Gaussian mixture models , 2005, INTERSPEECH.

[27]  Xiaodong Cui,et al.  Stereo-Based Stochastic Mapping for Robust Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Simon King,et al.  Accurate spectral envelope estimation for articulation-to-speech synthesis , 2004, SSW.

[29]  Yoshihiko Nankaku,et al.  On the Use of Phonetic Information for Mapping from Articulatory Movements to Vocal Tract Spectrum , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[30]  Tomoki Toda,et al.  Improving body transmitted unvoiced speech with statistical voice conversion , 2006, INTERSPEECH.

[31]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[32]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Heiga Zen,et al.  Speaker adaptation of trajectory HMMs using feature-space MLLR , 2006, INTERSPEECH.

[34]  Yoshihiko Nankaku,et al.  Spectral conversion based on statistical models including time-sequence matching , 2007, SSW.

[35]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[36]  Le Zhang,et al.  Modelling Speech Dynamics with Trajectory-HMMs , 2009 .

[37]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[38]  Li Deng,et al.  Uncertainty decoding with SPLICE for noise robust speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[39]  Alex Acero,et al.  Robust bandwidth extension of noise-corrupted narrowband speech , 2005, INTERSPEECH.

[40]  Tomoki Toda,et al.  BANDWIDTH EXTENSION OF CELLULAR PHONE SPEECH BASED ON MAXIMUM LIKELIHOOD ESTIMATION WITH GMM , 2008 .

[41]  全 炳河,et al.  Reformulating HMM as a trajectory model by imposing explicit relationships between static and dynamic features , 2006 .

[42]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[43]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[44]  Leonhard Held,et al.  Gaussian Markov Random Fields: Theory and Applications , 2005 .

[45]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[46]  Heiga Zen,et al.  Estimating Trajectory Hmm Parameters Using Monte Carlo Em With Gibbs Sampler , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.