论文信息 - Application-dependent prosodic models for text-to-speech synthesis and automatic design of learning database corpus using genetic algorithm

Application-dependent prosodic models for text-to-speech synthesis and automatic design of learning database corpus using genetic algorithm

services requiring continuous updates consists in using a Text-To-Speech synthesis system. This kind of speech synthesis system generates the speech messages sent to the users. The quality improvement of a Text-To-Speech synthesis system is usually considered as the arduous task of converting any text into speech. This paper is related to the work led at CNET in building applicationoriented text-to-speech systems. For a majority of vocal services, the delivered messages have a strong syntactic constraint and use a limited vocabulary. We consider that, with our system, the most hopeful improvements in the overall quality of the speech synthesis signal are linked to the linguistic and prosodic processing. Discarding here segmental problems of the synthetic speech signal, the actual prosodic patterns are judged as too monotonous to allow a great diversity of vocal services. Thus, the actual effort deals with the development of automatic systems to adapt the parameters of statistical prosodic models to a specific speaker's voice under the constraint of a limited amount of different syntactic structures. This work presents an automatic system to build "optimal" training databases used to learn the models' parameters. The formulation of the problem is defined as a set covering problem and is solved using genetic algorithms. Both an objective and a subjective evaluation show the usefulness of this approach. This paper presents an experiment conducted at CNET related to an intra-firm phone directory service using both speech recognition and text to speech synthesis. The overall quality of the CNETVOX TextTo-Speech synthesis system for French is generally judged as acceptable for main text to speech applications. But, for an application where the message has the same syntactic construction, the prosody is too monotonous. It is the reason why an automatic prosodic adaptation system has been developed. The objective of such a system is to learn as well as possible the prosody of one specific speaker uttering messages of the specific application. This application dependent speech synthesis system predicts the prosody with statistical models and needs learning databases to estimate the models' parameters. The goal of this paper is to demonstrate the crucial work of building an optimised learning database and its influence on the speech output quality. A method to build such an optimised database is proposed. This database is specific to an application and is recorded by a speaker. The speech synthesis sy tem tends to mimic the natural prosody exhibited by the speaker. In part one, the phone directory inquiry service is presented. Then, in part two, the method for building an optimised learning database is given. Finally, in part three, both an objective and subjective evaluation is exposed. INTRODUCTION The availability of effective telecommunications media allows nowadays a great development of services based on vocal technologies. Generally, these services are specialised: for example one can access a meteorological forecasting service, a phone order shopping or an intra-firm phone directory. The restricted field of the vocal service allows an ergonomic interface accepted by the end-user to be defined. Thus, the messages delivered by such systems have a small syntactic variability. Usually, a message is a sentence composed of two parts: one, invariable, is a template for variable fields containing the information for the end-user. Up to now, most vocal services use pre-recorded, compressed and stored natural speech. Obviously, speech output based on this technology has a very good quality. Nevertheless, this technology is unfeasible if the service needs fast updates in the databases: such updates occur in, for example, intra-firm phone directory application. The solution for vocal 1PHONE DIRECTORY INQUIRY SERVICE Using an intra-firm phone directory inquiry service, a user can automatically obtain, by phone speech recognition, the information on a phone correspondent and ask the system to perform or not the call. The system gives to the user the full name and the phone number of the correspondent. Two kinds of synthetic speech messages are delivered : unchanging messages related to the ergonomic constraints of the application and varying messages giving the information to the user (varying parts or fields are the first name, the family name and the phone number of the correspondent). At CNET, the acoustic level is realised by concatenating diphones and a set of longer units and processing the speech units with the TD-PSOLA technique [1] : this defines the acoustic string of the message. For the unchanging part of the message, a natural prosody is put on the acoustic string; for the varying parts of the message a model is used to generate the prosody on the acoustic string. variables taking values on a linear scale and normalised on [0,1]. The neural network used for the F0 is a three layered network. The input layer contains 18 cells, the hidden layer contains 10 cells, and the output layer 2 cells. The activation functions of the cells are sigmoid functions. As for the duration model, a modal variable is encoded in a binary way with a 1-in-n technique and a real variable is presented as a real continuous value to the network. 1.1Prosodic models For each varying field, the modelised prosodic parameters are the segmental duration (one value for each phoneme) and the F0 pattern (two values for each phoneme). As in lot of statistical prosodic models [2][3], input variables can be of different kinds: there are languageor application-dependent syntactical variables, syllabic variables and phonemic variables. The segmental duration and the F0 patterns are observed through a time dimension. The models, presently described, take into account this time dimension using contextual windows for each variable; the length of a contextual window depends on the nature of the variable. The models used both for duration and F0 prediction are neural networks. 1.2Learning process of model parameters The set of learning samples, defined in a training database, contains pairs of input/output variables observed in a natural prosody corpus. This set of learning samples is directly related to the quality of the models: their generality over new unseen inputs and their robustness. A learning database should contain the maximum variability of the phenomenon, but observed through the input variables of the models. Once the learning database is defined, the learning sentences are recorded by a speaker. An automatic labelling system of speech into phonemes (including speech segmentation into phones and the alignment of an automatic phoneme transcription of the text sentence with the speech) and an automatic processing of pitch tracking calculates prosodic information for each phoneme of the sentence. A segmental duration and two F0 values are assigned to each phoneme. This process can also detect an acoustic pause and its duration at a word boundary. The automatic phoneme sequence contains multiple phone sequence depending on phonological information or "regional" variants of pronunciation. This optimal sequence is chosen by the alignment process based on probabilistic criterion [4] 1.1.1Segmental duration model Three boolean variables say if the current phoneme is located or not at the end of a field, if the current phoneme is followed or not by a final pause and if it is followed or not by a non final pause. Two real variables give the position of the current syllable inside the current word and the number of syllables in the current word. Two real variables give the position of the current phoneme inside the syllable and the number of phonemes in the current syllable. Finally, three modal variables give the class of the previous, current and next phoneme (a class contains 7 modalities). The output (the duration of the current phoneme) is a real variable taking values on a linear scale and normalised on [0,1]. The neural network used for the duration is a three layered network. The input layer contains 32 cells, the hidden layer contains 15 cells, and the output layer one cell. The activation functions of the cells are sigmoid functions. A modal variable is encoded in a binary way with a 1-in-n technique. A real variable is presented as a real continuous value to the network. The Aspirin/MIGRAINES software [5] was used to optimise the neural networks parameters. A validation database (natural messages not used in the learning and test databases) is used to find a heuristic threshold in stopping the learning process. This threshold is a maximum of processing iterations on the whole training database defined when the mean square error of a model output increases on the validation database. 2OPTIMAL LEARNING DATABASE DESIGN 1.1.2F0 pattern model This section presents the solution developed to automatically design the learning database. The system finds a minimal set of sentences which covers the variability of the phenomena (duration and F0) modelised in the text-to-speech system. A modal variable indicates the field that the current phoneme belongs to. Two boolean variables say if the current phoneme is followed or not by a final pause and if it is followed or not by a non final word. Two real variables give the position of the syllable in the field and the number of syllables in the field. Two other real variables give the position of the current phoneme in the syllable and the number of phonemes in the syllable. Finally, a modal variable indicates the phonemic class of the current phoneme (1 modality over 7). Two output values give the F0 value at the beginning and at the end of the phoneme. These outputs are real 2.1Problem description With the phone directory inquiry application, only one type of sentence embedding variable fields is defined: "Jean Dupont, poste 12 00". Three different fields are defined: the first name, the family name and the ph

Olivier Boëffard | Françoise Emerard | Olivier Boëffard | F. Emerard

[1] Olivier Boëffard,et al. Multilingual PSOLA text-to-speech system , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2] T. Grossman,et al. Computational Experience with Approximation Algorithms for the Set Covering Problem , 1994 .

[3] John N. Gowdy,et al. Neural network based generation of fundamental frequency contours , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[4] David W. Corne,et al. Evolutionary Divide and Conquer for the Set-Covering Problem , 1996, Evolutionary Computing, AISB Workshop.