Speech synthesis without a phone inventory

In speech synthesis the unit inventory is decided using phonological and phonetic expertise. This process is resource intensive and potentially sub-optimal. In this paper we investigate how acoustic clustering, together with lexicon constraints, can be used to build a self-organised inventory. Six English speech synthesis systems were built using two frameworks, unit selection and parametric HTS for three inventory conditions: 1) a traditional phone set, 2) a system using orthographic units, and 3) a self-organised inventory. A listening test showed a strong preference for the classic system, and for the orthographic system over the self-organised system. Results also varied by letter to sound complexity and database coverage. This suggests the self-organised approach failed to generalise pronunciation as well as introducing noise above and beyond that caused by orthographic sound mismatch. Index Terms: speech synthesis, unit selection, parametric synthesis, phone inventory, orthographic synthesis

[1]  Dan Pelleg,et al.  K -Means with Large and Noisy Constraint Sets , 2007, ECML.

[2]  Simon King,et al.  Single speaker segmentation and inventory selection using dynamic time warping self organization and joint multigram mapping , 2007, SSW.

[3]  Simon King,et al.  Articulatory feature classifiers trained on 2000 hours of telephone speech , 2007, INTERSPEECH.

[4]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[6]  Robert I. Damper,et al.  Aligning letters and phonemes for speech synthesis , 2004, SSW.

[7]  James R. Glass,et al.  Towards unsupervised pattern discovery in speech , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[8]  Matthew P. Aylett Detecting High Level Dialog Structure Without Lexical Information , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[9]  Matthew P. Aylett,et al.  The CereVoice Characterful Speech Synthesiser SDK , 2007, IVA.

[10]  Heiga Zen,et al.  Speaker-Independent HMM-based Speech Synthesis System: HTS-2007 System for the Blizzard Challenge 2007 , 2007 .

[11]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.