Introductory Evaluation of the Swedish RealSpeak System

The development of different text-to-speech systems have evolved massively the last decade and new techniques have been introduced that have taken the technology further. Lenout& Hauspies RealSpeak is a diphone-concatenation system with automatic unit selection. What differs it most from other systems is that it uses raw speech as output. It also does not only have one instance of each diphone, it uses quite a large speech corpus, which contains hundreds of instances. Given a phoneme stream and a target prosody for an utterance, it selects an optimum set of acoustic units which best match the target specification. After working with the Swedish system there are a lot of subjective conclusions made, but an investigation about the bugs of the system and an analysis can help us to understand some of the problems with a natural sounding synthesis and maybe give some clues on how to deal with them. This article will deal with: - How does the Swedish Realspeak system perform? - What parts does it not handle well and why? - Is this technology something to build on for future text-to-speech systems and how in that case? An evaluation is needed since the technology used is far from perfect, but maybe the best so far. This also means that building new systems from this technological platform might give us an even more natural sounding synthesis, which is needed for a lot of applications. Tests will be based on subjective ratings from the demo at www.lhsl.com/realspeak and spectrogram analysis using Wavesurfer and Praat. Spectrograms can be found in the appendix.