A linguistic and prosodic database for data-driven Japanese TTS synthesis

We propose a method to generate a database that contains a parametric representation of F0 contours associated with linguistic and acoustic information, to be used by data-driven Japanese text-to-speech (TTS) systems. The configuration of the database includes recorded speech, F0 contours and their parametric labels, phonetic transcription with durations, and other linguistic information such as orthographic transcription, part-of-speech (POS) tags, and accent types. All information that is not available by dictionary lookup is obtained automatically. In this paper, we propose a method to automatically obtain parametric labels that describe F0 contours based on a superpositional model. Preliminary tests on a small data set show that the method can find the parametric representation of F0 contours with acceptable accuracy, and that accuracy can be improved by intr oducing additional linguistic information.

[1]  Harald Singer,et al.  Automatic prosodic segmentation by F/sub 0/ clustering using superpositional modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[2]  Keikichi Hirose,et al.  Analysis of voice fundamental frequency contours for declarative sentences of Japanese , 1984 .

[3]  Yoshinori Sagisaka,et al.  Automatic Extraction of F 0 Control Rules Using Statistical Analysis , 1997 .

[4]  Keikichi Hirose,et al.  A scheme for pitch extraction of speech using autocorrelation function with frame length proportional to the time lag , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Keikichi Hirose,et al.  Detection of phrase boundaries in Japanese by low-pass filtering of fundamental frequency contours , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[6]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.