This is a proposal of concatenative text to speech synthesizer for the Polish language, based on diphones and ”Harmonics and Noise Model”(HNM). HNM has been successfully applied on a speech encoder and decoder, resulting in a high-quality of processed speech at low bit rate. Applying this model to speech synthesis system allows obtaining good quality of synthesized speech, and the small size of database parameters. The proposed project consists of two main modules. The Natural Language Processing (NLP) is used to analyse and convert the written text for phonemes and diphones using morphological rules. NLP discovers at the same time prosodic features for later modification of synthesized speech parameters in order to obtain the stress and voice intonation. The second section is a synthesis system, derived from speech decoder, preceded by a system of adapting the parameters of speech based on prosodic rules. The system of speech synthesis from the parameters is working in the frequency domain and uses the frequency spectrum envelope, which easily allows modifying the frequency, amplitude and duration of the signal when applying the prosodic rules. The algorithm of continuous phase designation at the speech frame borders allows concatenating portions of synthesized speech and diphones without phase distortion on the merger. Speech synthesizer operates on the diphone database, created applying fragmentation of recorded speech signal representing the pairs of phonemes. Sounds related to diphones are analyzed by speech encoder. It provides the parameters that described harmonic and noise components of speech, using the linear prediction filter LSF coefficients, resulting in a small size of diphone database.
[1]
Agnieszka Wagner,et al.
Prosody annotation for corpus based speech synthesis
,
2006
.
[2]
Alexander A. Petrovsky,et al.
Analysis/Synthesis Speech Model Based on the Pitch-Tracking Periodic-Aperiodic Decomposition
,
2005,
Information Processing and Security Systems.
[3]
Paul Taylor,et al.
Festival Speech Synthesis System
,
1998
.
[4]
M. Sondhi,et al.
New methods of pitch extraction
,
1968
.
[5]
Philip J. B. Jackson,et al.
Pitch-scaled estimation of simultaneous voiced and turbulence-noise components in speech
,
2001,
IEEE Trans. Speech Audio Process..
[6]
Yannis Stylianou,et al.
Applying the harmonic plus noise model in concatenative speech synthesis
,
2001,
IEEE Trans. Speech Audio Process..
[7]
Jonas Beskow,et al.
Wavesurfer - an open source speech tool
,
2000,
INTERSPEECH.
[8]
Alexander A. Petrovsky,et al.
An improved speech model with allowance for time-varying pitch harmonic amplitudes and frequencies in low bit-rate MBE coders
,
1999,
EUROSPEECH.