Non-uniform unit selection and the similarity metric within BT's Laureate TTS system

In BT's Laureate text to speech system, the process of generating natural sounding synthetic speech from text can be viewed as a three stage process. The first stage attempts to convert general text into some form of normalised textual representation. This stage may consist of a number of components which are designed to handle domain specific problems. The second stage converts the normalised linear orthographic input data into a structured linguistic description. This stage consists of a number of components comprising orthography to phoneme conversion, syntactic analysis, performance parsing, and the prediction of duration and intonation. The third and final stage uses this linguistic structure to generate synthetic speech. The nature of the production stage differs depending on the production method. The Laureate system has, over the last five years, been concentrating on a concatenative approach. Clearly the method of unit selection used within this stage contributes significantly to the eventual quality of the synthetic speech produced. This paper will describe the method of unit selection currently implemented within Laureate. Concatenative speech synthesis systems generate speech from a unit inventory of sounds. The phoneme has proved the most popular symbolic representation of sound in these systems, but simply storing one sample phone for each phoneme is not sufficient for good quality synthesis. Coarticulation is one reason why this is so – the production of one phone can be highly influenced by its preceding and following neighbours. The challenge for all methods of unit selection is to provide an efficient method of selecting units which, in some clearly specified way, provide the best approximation to the desired phones available within the inventory. The Laureate system uses mixed N-phone units. In theory such units could be of arbitrary size, but in practice, they are constrained to a maximum of three phones (triphone). In addition, unlike traditional methods of unit selection, Laureate does not attempt to find the best unit from a fixed pre-selected set. Rather, it dynamically generates a sequence of units based on a global cost. Units are selected using purely phonologically motivated criteria, without reference to any acoustic features either desired or available within the inventory. Details of the selection process will be provided within the paper together with a discussion on existing short falls of the method and future envisaged improvements.