WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications

A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of realtime applications using speech. Speech analysis, manipulation, and synthesis on the basis of vocoders are used in various kinds of speech research. Although several high-quality speech synthesis systems have been developed, real-time processing has been difficult with them because of their high computational costs. This new speech synthesis system has not only sound quality but also quick processing. It consists of three analysis algorithms and one synthesis algorithm proposed in our previous research. The effectiveness of the system was evaluated by comparing its output with against natural speech including consonants. Its processing speed was also compared with those of conventional systems. The results showed that WORLD was superior to the other systems in terms of both sound quality and processing speed. In particular, it was over ten times faster than the conventional systems, and the real time factor (RTF) indicated that it was fast enough for real-time processing. key words: speech analysis, speech synthesis, vocoder, sound quality, realtime processing

[1]  M. Mathews,et al.  Pitch Synchronous Analysis of Voiced Sounds , 1961 .

[2]  A. Noll Short‐Time Spectrum and “Cepstrum” Techniques for Vocal‐Pitch Detection , 1964 .

[3]  R. Plomp,et al.  Effect of phase on the timbre of complex tones. , 1969, The Journal of the Acoustical Society of America.

[4]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals , 1983 .

[5]  Jae S. Lim,et al.  A new model-based speech analysis/Synthesis system , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[7]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[8]  Thomas P. Barnwell,et al.  MCCREE AND BARNWELL MIXED EXCITAmON LPC VOCODER MODEL LPC SYNTHESIS FILTER 243 SYNTHESIZED SPEECH-PERIODIC PULSE TRAIN-1 PERIODIC POSITION JITTER PULSE 4 , 2004 .

[9]  Nick Campbell,et al.  Optimising selection of units from speech databases for concatenative synthesis , 1995, EUROSPEECH.

[10]  Hideki Kawahara,et al.  Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[12]  L. H. Anauer,et al.  Speech Analysis and Synthesis by Linear Prediction of the Speech Wave , 2000 .

[13]  Sugato Chakravarty,et al.  Method for the subjective assessment of intermedi-ate quality levels of coding systems , 2001 .

[14]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[15]  RECOMMENDATION ITU-R BS.1534-1 - Method for the subjective assessment of intermediate quality level of coding systems , 2003 .

[16]  Hideki Kawahara,et al.  Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT , 2005, INTERSPEECH.

[17]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18]  Hideki Kawahara,et al.  Implementation of realtime STRAIGHT speech manipulation system: Report on its first implementation , 2007 .

[19]  John G Harris,et al.  A sawtooth waveform inspired pitch estimator for speech and music. , 2008, The Journal of the Acoustical Society of America.

[20]  Logan Volkers,et al.  PHASE VOCODER , 2008 .

[21]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Hideki Kawahara,et al.  v.morish'09: A Morphing-Based Singing Design Interface for Vocal Melodies , 2009, ICEC.

[23]  Hideki Kawahara,et al.  Fast and Reliable F0 Estimation Method Based on the Period Extraction of Vocal Fold Vibration of Singing Voice and Speech , 2009 .

[24]  Tomoki Toda,et al.  Improvements of the One-to-Many Eigenvoice Conversion System , 2010, IEICE Trans. Inf. Syst..

[25]  Takanobu Nishiura,et al.  Vocal Manipulation Based on Pitch Transcription and Its Application to Interactive Entertainment for Karaoke , 2011, HAID.

[26]  HIDEKI KAWAHARA,et al.  Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework , 2011 .

[27]  Masanori Morise PLATINUM: A method to extract excitation signals for voice synthesis system , 2012 .

[28]  Mark J. F. Gales,et al.  Complex cepstrum as phase information in statistical parametric speech synthesis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Hideki Kawahara,et al.  Simplified aperiodicity representation for high-quality speech manipulation systems , 2012, 2012 IEEE 11th International Conference on Signal Processing.

[30]  Hideki Kenmochi,et al.  Singing synthesis as a new musical instrument , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Masataka Goto,et al.  A spectral envelope estimation method based on F0-adaptive multi-frame integration analysis , 2012, SAPA@INTERSPEECH.

[32]  Hideki Kawahara,et al.  Temporally variable multi-aspect N-way morphing based on interference-free speech representations , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[33]  Tomoki Toda,et al.  Excitation source analysis for high-quality speech manipulation systems based on an interference-free representation of group delay with minimum phase response compensation , 2014, INTERSPEECH.

[34]  Yannis Agiomyrgiannakis,et al.  Vocaine the vocoder and applications in speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Masanori Morise,et al.  CheapTrick, a spectral envelope estimator for high-quality speech synthesis , 2015, Speech Commun..

[36]  Masanori Morise,et al.  Error Evaluation of an F0-Adaptive Spectral Envelope Estimator in Robustness against the Additive Noise and F0 Error , 2015, IEICE Trans. Inf. Syst..