Automatic corpus-based tone and break-index prediction using K-ToBI representation

In this article we present a prosody generation architecture based on K-ToBI (Korean Tone and Break Index) representation. ToBI is a multitier representation system based on linguistic knowledge that transcribes events in an utterance. The TTS (Text-To-Speech) system, which adopts ToBI as an intermediate representation, is known to exhibit higher flexibility, modularity, and domain/task portability compared to the direct prosody generation TTS systems. However, for practical-level performance, the cost of corpus preparation is very expensive because the ToBI labeled corpus is constructed manually by many prosody experts, and normally requires large amounts of data for statistical prosody modeling. Unlike previous ToBI-based systems, this article proposes a new method, which transcribes the K-ToBI labels in Korean speech completely automatically. We develop automatic corpus-based K-ToBI labeling tools and prediction methods based on several lexico-syntactic linguistic features for decision-tree induction. We demonstrate the performance of F0 generation from automatically predicted K-ToBI labels, and confirm that the performance is reasonably comparable to state-of-the-art direct prosody generation methods and previous ToBI-based methods.

[1]  Gary Geunbae Lee,et al.  Generalized unknown morpheme guessing for hybrid POS tagging of Korean , 1998, VLC@COLING/ACL.

[2]  Gregor Möhler,et al.  Parametric modeling of intonation using vector quantization , 1998, SSW.

[3]  C. J. Stone,et al.  A Course in Probability and Statistics , 1995 .

[4]  Alan W. Black,et al.  Generating F/sub 0/ contours from ToBI labels using linear regression , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[5]  Paul Taylor,et al.  The rise/fall/connection model of intonation , 1994, Speech Communication.

[6]  Sang-Ho Lee,et al.  Tree-based modeling of prosody for Korean TTS systems = 한국어 TTS 시스템을 위한 운율의 트리 기반 모델링 , 2000 .

[7]  Julia Hirschberg,et al.  Progress in speech synthesis , 1997 .

[8]  Kenneth N. Ross Modeling of intonation for speech synthesis , 1995 .

[9]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[10]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[11]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[12]  Christophe d'Alessandro,et al.  Automatic pitch contour stylization using a model of tonal perception , 1995, Comput. Speech Lang..

[13]  Mari Ostendorf,et al.  A dynamical system model for generating fundamental frequency for speech synthesis , 1999, IEEE Trans. Speech Audio Process..

[14]  Sumio Ohno,et al.  Analysis and modeling of fundamental frequency contours of English utterances , 1995, EUROSPEECH.

[15]  Eric Sanders,et al.  Using Statistical Models to Predict Phrase Boundaries for Speech Synthesis , 1995 .

[16]  Paul Taylor,et al.  Assigning phrase breaks from part-of-speech sequences , 1997, Comput. Speech Lang..

[17]  S. Jun,et al.  K-Tobi (Korean ToBI) Labelling Conventions , 2000 .

[18]  Young-Il Kim,et al.  A computational algorithm for F0 contour generation in Korean developed with prosodically labeled databases using k-toBI system , 1998, ICSLP.

[19]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .