Measuring attribute dissimilarity with HMM KL-divergence for speech synthesis

This paper proposes to use KLD between context-dependent HMMs as target cost in unit selection TTS systems. We train context-dependent HMMs to characterize the contextual attributes of units, and calculate Kullback-Leibler Divergence (KLD) between the corresponding models. We demonstrate that the KLD measure provides a statistically meaningful way to analyze the underlining relations among elements of attributes. With the aid of multidimensional scaling, a set of attributes, including phonetic, prosodic and numerical contexts, are examined by graphically representing elements of the attribute as points on a low dimensional space, where the distances among points agree with the KLDs among the elements. The KLD between multi-space probability distribution HMMs is derived. A perceptual experiment shows that the TTT system defined with the KLD-based target cost sounds slightly better than one with the manuallytuned.

[1]  Yannis Stylianou,et al.  Perceptual and objective detection of discontinuities in concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[2]  Yong Zhao,et al.  Microsoft Mulan - a bilingual TTS system , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[3]  Paul Taylor,et al.  The target cost formulation in unit selection speech synthesis , 2006, INTERSPEECH.

[4]  Frank K. Soong,et al.  Divergence-Based Similarity Measure for Spoken Document Retrieval , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[6]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  Yuan Jiang,et al.  Multi-tier Non-uniform Unit Selection for Corpus-based Speech Synthesis , 2006 .

[8]  Nick Campbell,et al.  Optimising selection of units from speech databases for concatenative synthesis , 1995, EUROSPEECH.

[9]  Yong Zhao,et al.  Perpetually optimizing the cost function for unit selection in a TTS system with one single run of MOS evaluation , 2002, INTERSPEECH.

[10]  Shiri Gordon,et al.  An efficient image similarity measure based on approximations of KL-divergence between two gaussian mixtures , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[11]  M. Do Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov models , 2003, IEEE Signal Processing Letters.

[13]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[14]  Tomoki Toda,et al.  Optimizing sub-cost functions for segment selection based on perceptual evaluations in concatenative speech synthesis , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Abeer Alwan,et al.  Text to Speech Synthesis: New Paradigms and Advances , 2004 .