CMU Blizzard 2007: A Hybrid Acoustic Unit Selection System from Statistically Predicted Parameters

This paper describes CMU's entry for the Blizzard Challenge 2007. Our eventual system consisted of a hybrid statistical parameter generation system whose output was used to do acoustic unit selection. After testing a number of varied systems, this system proved the best in our internal tests. This paper also explains some of the limitations we see in our techniques. The CMU system is identified as D in the result charts. Given the larger speech databases, the teams (also referred to as sites) were asked to build the speech synthesis systems A, B, and C in the space of four weeks time, and then asked to synthesize a common set of sentences for perceptual evaluation. System A denotes the TTS system built from the whole of the database, B denotes the TTS system built from the ARCTIC subset, and C denotes the TTS system built from a site-defined subset. To avoid multiple submissions from a site, the sites were asked to submit their best system for A, B, and C to compare against those of other teams. As a part of the Blizzard Challenge, we wanted to investigate techniques of generating a natural and consistent quality synthesis by a method of acoustic unit selection from statistically predicted parameters. We have built synthesis systems using the unit selection technique CLUNITS, the statistical parametric synthesis technique CLUSTERGEN, and also a hybrid technique of unit selection from statistically predicted parameters. An internal evaluation of the synthesis systems showed that the hybrid system produced consistent and natural speech and was perceived to be better than the CLUNITS and CLUSTERGEN systems. The hybrid system was submitted as our final system for A and B type comparisons. The remainder of this paper describes the details of the implementation and performance of CLUNITS, CLUSTERGEN, and hybrid systems on Blizzard datasets.

[1]  Kishore Prahallad,et al.  Automatic building of synthetic voices from large multi-paragraph speech databases , 2007, INTERSPEECH.

[2]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[3]  Alan W. Black,et al.  CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling , 2006, INTERSPEECH.

[4]  Volker Strom,et al.  Multisyn Voice for the Blizzard Challenge 2006 , 2006 .

[5]  S S Stevens,et al.  On the Theory of Scales of Measurement. , 1946, Science.

[6]  Christina L. Bennett Large scale evaluation of corpus-based synthesizers: results and lessons from the blizzard challenge 2005 , 2005, INTERSPEECH.

[7]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[8]  Keiichi Tokuda,et al.  The blizzard challenge - 2005: evaluating corpus-based speech synthesis on common datasets , 2005, INTERSPEECH.

[9]  Paul Taylor,et al.  Automatically clustering similar units for unit selection in speech synthesis , 1997, EUROSPEECH.

[10]  Arthur R. Toth,et al.  The blizzard challenge 2005 CMU entry - a method for improving speech synthesis systems , 2005, INTERSPEECH.

[11]  Heiga Zen,et al.  An overview of nitech HMM-based speech synthesis system for blizzard challenge 2005 , 2005, INTERSPEECH.