In BT's Laureate text to speech system, the process of generating natural sounding synthetic speech from text can be viewed as a three stage process. The first stage attempts to convert general text into some form of normalised textual representation. This stage may consist of a number of components which are designed to handle domain specific problems. The second stage converts the normalised linear orthographic input data into a structured linguistic description. This stage consists of a number of components comprising orthography to phoneme conversion, syntactic analysis, performance parsing, and the prediction of duration and intonation. The third and final stage uses this linguistic structure to generate synthetic speech. The nature of the production stage differs depending on the production method. The Laureate system has, over the last five years, been concentrating on a concatenative approach. Clearly the method of unit selection used within this stage contributes significantly to the eventual quality of the synthetic speech produced. This paper will describe the method of unit selection currently implemented within Laureate. Concatenative speech synthesis systems generate speech from a unit inventory of sounds. The phoneme has proved the most popular symbolic representation of sound in these systems, but simply storing one sample phone for each phoneme is not sufficient for good quality synthesis. Coarticulation is one reason why this is so – the production of one phone can be highly influenced by its preceding and following neighbours. The challenge for all methods of unit selection is to provide an efficient method of selecting units which, in some clearly specified way, provide the best approximation to the desired phones available within the inventory. The Laureate system uses mixed N-phone units. In theory such units could be of arbitrary size, but in practice, they are constrained to a maximum of three phones (triphone). In addition, unlike traditional methods of unit selection, Laureate does not attempt to find the best unit from a fixed pre-selected set. Rather, it dynamically generates a sequence of units based on a global cost. Units are selected using purely phonologically motivated criteria, without reference to any acoustic features either desired or available within the inventory. Details of the selection process will be provided within the paper together with a discussion on existing short falls of the method and future envisaged improvements.
[1]
Carsten Jürgens,et al.
A comparison of different speech units for the German TTS-system tubsy
,
1995,
EUROSPEECH.
[2]
R. D. Johnston.
Beyond intelligibility : the performance of text-to-speech synthesisers
,
1996
.
[3]
A. C. Gimson,et al.
An introduction to the pronunciation of English
,
1991
.
[4]
Alan W. Black,et al.
Prosody and the Selection of Source Units for Concatenative Synthesis
,
1997
.
[5]
Paul Taylor,et al.
Automatically clustering similar units for unit selection in speech synthesis
,
1997,
EUROSPEECH.
[6]
John L. Arnott,et al.
Synthesizing emotions in speech: is it time to get excited?
,
1996,
Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.
[7]
Thierry Dutoit,et al.
The MBROLA project: towards a set of high quality speech synthesizers free of use for non commercial purposes
,
1996,
Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.
[8]
J. H. Page,et al.
The Laureate text-to-speech system : architecture and applications
,
1996
.