Join cost for unit selection speech synthesis

In unit-selection speech synthesis systems, synthetic speech is produced by concatenating speech units selected from a large database, or inventory, which contains many instances of each speech unit with varied prosodic and spectral characteristics. Hence, by selecting an appropriate sequence of units, it is possible to synthesize highly natural-sounding speech. The selection of the best unit sequence from the database is typically treated as a search problem in which the best sequence of candidates from the inventory is the one that has the lowest overall cost [1]. This cost is often decomposed into two costs: a target cost (how closely candidate units in the inventory match the specification of the target phone sequence) and join cost (how well neighboring units can be joined) [1]. If, as is usually the case, the cost functions used to compute these costs take into account only properties of the fixed target sequence and local properties of the candidates, the optimal unit sequence can be found efficiently by a Viterbi search for the lowest cost path through the lattice of the target and join costs. In this chapter we focus on the calculation of the join cost (also known as concatenation cost). The ideal join cost is one that, although based solely on measurable properties of the candidate units—such as spectral parameters, amplitude, and F0—correlates highly with human listeners’ perceptions of discontinuity at concatenation points. In other words, the join cost should predict the degree of perceived discontinuity. We use this terminology: a join cost is computed using a join cost function, which generally uses a distance measure on some parameterization of the speech signal.

[1]  Robert E. Donovan,et al.  The IBM trainable speech synthesis system , 1998, ICSLP.

[2]  Tony Greenfield,et al.  Theory and Problems of Probability and Statistics , 1982 .

[3]  M. Jack,et al.  Globally optimising formant tracker using generalised centroids , 1987 .

[4]  Michael W. Macon,et al.  A perceptual evaluation of distance measures for concatenative speech synthesis , 1998, ICSLP.

[5]  Alan A. Wrench ANALYSIS OF FRICATIVES USING MULTIPLE CENTRES OF GRAVITY , 1999 .

[6]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  Raymond N. J. Veldhuis,et al.  On the reduction of concatenation artefacts in diphone synthesis , 1998, ICSLP.

[8]  Raymond N. J. Veldhuis,et al.  On the computation of the Kullback-Leibler measure for spectral distances , 2003, IEEE Trans. Speech Audio Process..

[9]  Biing-Hwang Juang,et al.  Line spectrum pair (LSP) and speech data compression , 1984, ICASSP.

[10]  Marc C. Beutnagel,et al.  The AT & T NEXT-GEN TTS system , 1999 .

[11]  Yong Zhao,et al.  Perpetually optimizing the cost function for unit selection in a TTS system with one single run of MOS evaluation , 2002, INTERSPEECH.

[12]  Raymond N. J. Veldhuis,et al.  Reducing audible spectral discontinuities , 2001, IEEE Trans. Speech Audio Process..

[13]  Simon King,et al.  New objective distance measures for spectral discontinuities in concatenative speech synthesis , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[14]  Minkyu Lee Perceptual Cost Functions for Unit Searching in Large Corpus-based Concatenative Text-to-Speech , 2001 .

[15]  Ann K. Syrdal Phonetic effects on listener detection of vowel concatenation , 2001, INTERSPEECH.

[16]  Hisashi Kawai,et al.  Feature extraction for unit selection in concatenative speech synthesis: comparison between AIM, LPC, and MFCC , 2002, INTERSPEECH.

[17]  Robert E. Donovan,et al.  A new distance measure for costing spectral discontinuities in concatenative speech synthesizers , 2001, SSW.

[18]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[19]  Yannis Stylianou,et al.  Perceptual and objective detection of discontinuities in concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[20]  Hu Peng,et al.  An objective measure for estimating MOS of synthesized speech , 2001, INTERSPEECH.

[21]  Hisashi Kawai,et al.  Acoustic measures vs. phonetic features as predictors of audible discontinuity in concatenative speech synthesis , 2002, INTERSPEECH.

[22]  Nick Campbell,et al.  Objective distance measures for assessing concatenative speech synthesis , 1999, EUROSPEECH.

[23]  Jithendra Vepa OBJECTIVE DISTANCE MEASURES FOR SPECTRAL DISCONTINUITIES IN CONCATENATIVE SPEECH SYNTHESIS , 2002 .

[24]  Raymond N. J. Veldhuis,et al.  A solution to the reduction of concatenation artefacts in speech synthesis , 2000, INTERSPEECH.

[25]  Paul C. Bagshaw,et al.  Concatenation cost calculation and optimisation for unit selection in TTS , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[26]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[27]  George Carayannis,et al.  Reducing spectral mismatches in concatenative speech synthesis via systematic database enrichment , 2001, INTERSPEECH.

[28]  R. Patterson,et al.  Time-domain modeling of peripheral auditory processing: a modular architecture and a software platform. , 1995, The Journal of the Acoustical Society of America.