Using acoustics to improve pronunciation for synthesis of low resource languages

Some languages have very consistent mappings between graphemes and phonemes, while in other languages, this mapping is more ambiguous. Consonantal writing systems prove to be a challenge for Text to Speech Systems (TTS) because they do not indicate short vowels, which creates an ambiguity in pronunciation. Special letter-to-sound rules may be needed for some cases in languages that otherwise have a good correspondence between graphemes and phonemes. In the low-resource scenario, we may not have linguistic resources such as diacritizers or hand-written rules for the language. We propose a technique to automatically learn pronunciations iteratively from acoustics during TTS training and predict pronunciations from text during synthesis time. We conduct experiments on dialects of Arabic for disambiguating homographs and Hindi for discovering the schwa-deletion rules. We evaluate our systems using objective and subjective metrics of TTS and show significant improvements for dialects of Arabic. Our methods can be generalized to other languages that exhibit similar phenomena.

[1]  David Yarowsky,et al.  Homograph Disambiguation in Text-to-Speech Synthesis , 1997 .

[2]  Richard M. Stern,et al.  The 1996 Hub-4 Sphinx-3 System , 1997 .

[3]  Paul Taylor,et al.  The architecture of the Festival speech synthesis system , 1998, SSW.

[4]  Tomoki Toda,et al.  Evaluation of cross-language voice conversion based on GMM and straight , 2001, INTERSPEECH.

[5]  Alexander Clark,et al.  Combining Distributional and Morphological Information for Part of Speech Induction , 2003, EACL.

[6]  Franz Kummert,et al.  Data-driven pronunciation modeling for ASR using acoustic subword units , 2003, INTERSPEECH.

[7]  Alan W. Black,et al.  Using acoustic models to choose pronunciation variations for synthetic voices , 2003, INTERSPEECH.

[8]  Mirjam Wester,et al.  Pronunciation modeling for ASR - knowledge-based and data-derived methods , 2003, Comput. Speech Lang..

[9]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[10]  Richard Sproat,et al.  Schwa-Deletion in Hindi Text-to-Speech Synthesis , 2004, Int. J. Speech Technol..

[11]  Monojit Choudhury,et al.  A Diachronic Approach for Schwa Deletion in Indo Aryan Languages , 2004, SIGMORPHON@ACL.

[12]  Alan W. Black,et al.  Prediction of pronunciation variations for speech synthesis: a data-driven approach , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[13]  David Graff,et al.  Lexicon Development for Varieties of Spoken Colloquial Arabic , 2006, LREC.

[14]  Alan W. Black,et al.  CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling , 2006, INTERSPEECH.

[15]  Tanja Schultz,et al.  Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion , 2008, SLTU.

[16]  Na'im R. Tyson,et al.  Prosodic rules for schwa-deletion in hindi text-to-speech synthesis , 2009, Int. J. Speech Technol..

[17]  Ronald Rosenfeld,et al.  Small-vocabulary speech recognition for resource-scarce languages , 2010, ACM DEV '10.

[18]  Su-Youn Yoon,et al.  A Python Toolkit for Universal Transliteration , 2010, LREC.

[19]  Nizar Habash,et al.  50th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference Volume 2: Short Papers , 2012 .

[20]  Ronald Rosenfeld,et al.  Discriminative pronunciation learning for speech recognition for resource scarce languages , 2012, ACM DEV '12.

[21]  Alan W. Black,et al.  Bootstrapping Text-to-Speech for speech processing in languages without an orthography , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Ibrahim Almosallam,et al.  SASSC: a standard Arabic single speaker corpus , 2013, SSW.

[23]  S. King,et al.  The Blizzard Challenge 2014 , 2014 .

[24]  Alan W. Black,et al.  Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).