Variable-Length Unit Selection in TTS Using Structural Syntactic Cost

This paper presents a variable-length unit selection scheme based on syntactic cost to select text-to-speech (TTS) synthesis units. The syntactic structure of a sentence is derived from a probabilistic context-free grammar (PCFG), and represented as a syntactic vector. The syntactic difference between target and candidate units (words or phrases) is estimated by the cosine measure with the inside probability of PCFG acting as a weight. Latent semantic analysis (LSA) is applied to reduce the dimensionality of the syntactic vectors. The dynamic programming algorithm is adopted to obtain a concatenated unit sequence with minimum cost. A syntactic property-rich speech database is designed and collected as the unit inventory. Several experiments with statistical testing are conducted to assess the quality of the synthetic speech as perceived by human subjects. The proposed method outperforms the synthesizer without considering syntactic property. The structural syntax estimates the substitution cost better than the acoustic features alone

[1]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2]  David E. Booth,et al.  Multivariate statistical inference and applications , 1997 .

[3]  Alex Acero,et al.  Whistler: a trainable text-to-speech system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[4]  Hubert Truckenbrodt,et al.  Phonological phrases : their relation to syntax, focus, and prominence , 1995 .

[5]  J.R. Bellegarda,et al.  Exploiting latent semantic information in statistical language modeling , 2000, Proceedings of the IEEE.

[6]  G. Cinque A null theory of phrase and compound stress , 1993 .

[7]  R. Brown,et al.  A First Language , 1973 .

[8]  Jon R. W. Yi,et al.  Corpus-based unit selection for natural-sounding speech synthesis , 2003 .

[9]  Chiu-yu Tseng,et al.  Corpus-based Mandarin speech synthesis with contextual syllabic units based on phonetic properties , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[10]  Masanobu Abe,et al.  A Japanese TTS system based on multiform units and a speech modification algorithm with harmonics reconstruction , 2001, IEEE Trans. Speech Audio Process..

[11]  Phillip Taylor,et al.  Concept-to-speech synthesis by phonological structure matching , 2000, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[12]  William C. Schefler,et al.  Statistics for health professionals , 1979 .

[13]  Chung-Hsien Wu,et al.  Automatic generation of synthesis units and prosodic information for Chinese concatenative synthesis , 2001, Speech Commun..

[14]  Paul Taylor,et al.  Speech synthesis by phonological structure matching , 1999, EUROSPEECH.

[15]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[16]  James R. Glass,et al.  Natural-sounding speech synthesis using variable-length units , 1998, ICSLP.

[17]  Eduardo Rodríguez Banga,et al.  On the design of cost functions for unit-selection speech synthesis , 2003, INTERSPEECH.

[18]  Chiu-yu Tseng,et al.  A Chinese text-to-speech system based on part-of-speech analysis, prosodic modeling and non-uniform units , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Elisabeth Selkirk,et al.  Phonology and Syntax: The Relation between Sound and Structure , 1984 .

[20]  Chiu-yu Tseng,et al.  The interplay and interaction between prosody and syntax: evidence from Mandarin Chinese , 2000, INTERSPEECH.

[21]  Stephen J. Cox,et al.  Unit selection in concatenative TTS synthesis systems based on mel filter bank amplitudes and phonetic context , 2003, INTERSPEECH.

[22]  Peter Jackson,et al.  Non-uniform unit selection and the similarity metric within BT's Laureate TTS system , 1998, SSW.

[23]  Chung-Hsien Wu,et al.  Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[25]  Marcus L. Fach A comparison between syntactic and prosodic phrasing , 1999, EUROSPEECH.

[26]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[27]  Nam Soo Kim,et al.  Discriminative training for concatenative speech synthesis , 2004, IEEE Signal Process. Lett..

[28]  Tomoki Toda,et al.  Optimizing integrated cost function for segment selection in concatenative speech synthesis based on perceptual evaluations , 2003, INTERSPEECH.

[29]  Chung-Hsien Wu,et al.  Recovery from false rejection using statistical partial pattern trees for sentence verification , 2004, Speech Commun..