Quantized HMMs for low footprint text-to-speech synthesis

This paper proposes the use of Quantized Hidden Markov Models (QHMMs) for reducing the footprint of conventional parametric HMM-based TTS system. Previously, this technique was successfully applied to automatic speech recognition in embedded devices without loss of recognition performance. In this paper we investigate the construction of different quantized HMM configurations that serve as input to the standard ML-based parameter generation algorithm. We use both subjective and objective tests to compare the resulting systems. Subjective results for specific compression configurations show no significant preference although some spectral distortion is reported. We conclude that a trade-off is necessary in order to satisfy both speech quality and low-footprint memory requirements.

[1]  Marcel Vasilache,et al.  Speaker adaptation of quantized parameter HMMs , 2001, INTERSPEECH.

[2]  Karim Filali,et al.  Algorithms for data-driven ASR parameter quantization , 2006, Comput. Speech Lang..

[3]  Heiga Zen,et al.  Tying covariance matrices to reduce the footprint of HMM-based speech synthesis systems , 2009, INTERSPEECH.

[4]  Paul Taylor,et al.  Automatically clustering similar units for unit selection in speech synthesis , 1997, EUROSPEECH.

[5]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[6]  Brian Kan-Wing Mak,et al.  Subspace distribution clustering for continuous observation density hidden Markov models , 1997, EUROSPEECH.

[7]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[8]  Imre Kiss,et al.  Comparison of low footprint acoustic modeling techniques for embedded ASR systems , 2005, INTERSPEECH.

[9]  Gérard Chollet,et al.  Efficient codebooks for fast and accurate low resource ASR systems , 2009, Speech Commun..

[10]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Zvi Kons,et al.  Reducing the footprint of the IBM trainable speech synthesis system , 2002, INTERSPEECH.

[12]  Heiga Zen,et al.  Speaker-Independent HMM-based Speech Synthesis System: HTS-2007 System for the Blizzard Challenge 2007 , 2007 .

[13]  Marcel Vasilache,et al.  Speech recognition using HMMs with quantized parameters , 2000, INTERSPEECH.

[14]  Frank K. Soong,et al.  Hidden Markov models with divergence based vector quantized variances , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[15]  Inderjit S. Dhillon,et al.  Differential Entropic Clustering of Multivariate Gaussians , 2006, NIPS.

[16]  I. Kiss,et al.  Nokia ’ s System For TC-STAR EPPS English ASR Evaluation Task , 2006 .

[17]  Joan Claudi Socoró,et al.  Local minimum generation error criterion for hybrid HMM speech synthesis , 2009, INTERSPEECH.

[18]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[19]  Ren-Hua Wang,et al.  HMM-Based Hierarchical Unit Selection Combining Kullback-Leibler Divergence with Likelihood Criterion , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.