Implementation of realtime STRAIGHT speech manipulation system: Report on its first implementation

A very high quality speech analysis, modification and synthesis system—STRAIGHT—has now been implemented in C language and operated in realtime. This article first provides a brief summary of STRAIGHT components and then introduces the underlying principles that enabled realtime operation. In STRAIGHT, the built-in extended pitch synchronous analysis, which does not require analysis window alignment, plays an important role in realtime implementation. A detailed description of the processing steps, which are based on the so-called “just-in-time” architecture, is presented. Further, discussions on other issues related to realtime implementation and performance measures are also provided. The software will be available to researchers upon request.

[1]  Hideki Kawahara,et al.  Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT , 2005, INTERSPEECH.

[2]  Diane Kewley-Port,et al.  Vowel formant discrimination for high-fidelity speech. , 2004, The Journal of the Acoustical Society of America.

[3]  Hideki Kawahara,et al.  Algorithm amalgam: morphing waveform based methods, sinusoidal models and STRAIGHT , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Hideki Kawahara,et al.  Intelligibility of degraded speech from smeared STRAIGHT spectrum , 2004, INTERSPEECH.

[5]  Richard E. Turner,et al.  The processing and perception of size information in speech sounds. , 2005, The Journal of the Acoustical Society of America.

[6]  Tomoko Yonezawa,et al.  Gradually changing expression of singing voice based on morphing , 2005, INTERSPEECH.

[7]  Peter F Assmann,et al.  Synthesis fidelity and time-varying spectral change in vowels. , 2005, The Journal of the Acoustical Society of America.

[8]  Hideki Kawahara,et al.  Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Hideki Kawahara,et al.  Accurate vocal event detection method based on a fixed-point analysis of mapping from time to weighted average group delay , 2000, INTERSPEECH.

[10]  A. Oppenheim Speech analysis-synthesis system based on homomorphic filtering. , 1969, The Journal of the Acoustical Society of America.

[11]  Hideki Kawahara,et al.  Auditory morphing based on an elastic perceptual distance metric in an interference-free time-frequency representation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[12]  Roy D. Patterson,et al.  Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity , 1999, EUROSPEECH.

[13]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[14]  Hideki Kawahara,et al.  Investigation of emotionally morphed speech perception and its structure using a high quality speech manipulation system , 2003, INTERSPEECH.

[15]  Roy D. Patterson,et al.  Segregating information about the size and shape of the vocal tract using a time-domain auditory model: The stabilised wavelet-Mellin transform , 2002, Speech Commun..