On Improving Code Mixed Speech Synthesis with Mixlingual Grapheme-to-Phoneme Model

Regional entities often occur in a code-mixed text in the nonnative roman script and synthesizing them with the correct pronunciation and accent is a challenging problem. English grapheme-to-phoneme (G2P) rules fail for such entities because of the orthographical mistakes and phonological differences between the English and regional languages. The traditional approach for this problem involves language identification, followed by the transliteration of the regional entities to their native language and then passing them through a native G2P. In this work, we simplify this module based architecture by learning an end-to-end mixlingual G2P in a multi-task type setting. Also, rather than mapping the output phone sequences from our mixlingual G2P to the English phoneset or using the “shared” phoneset, we use the polyglot data and “separated” phoneset to train a mixlingual synthesizer to improvise the synthesized voice accent for regional entities. We have used Hindi-English as the code-mix scenario and we show absolute incremental gains of up to 28% in pronunciation accuracy and a 0.9 gain in “overall impression” mean-opinion-score (MOS) over using a standard English monolingual text-to-speech (TTS).

[1]  Fei Xia,et al.  Language ID in the Context of Harvesting Language Data off the Web , 2009, EACL.

[2]  Nick Campbell TALKING FOREIGN - concatenative speech synthesis and the language barrier , 2001, INTERSPEECH.

[3]  Heiga Zen,et al.  Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning , 2019, INTERSPEECH.

[4]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[5]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Hema A. Murthy,et al.  Code-switching in Indic Speech Synthesisers , 2018, INTERSPEECH.

[7]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[8]  Thomas Breuel,et al.  Sequence-to-sequence neural network models for transliteration , 2016, ArXiv.

[9]  Prasenjit Majumder,et al.  Overview of the FIRE 2013 Track on Transliterated Search , 2013, FIRE.

[10]  Hema A. Murthy,et al.  A common attribute based unified HTS framework for speech synthesis in Indian languages , 2013, SSW.

[11]  Pushpak Bhattacharyya,et al.  Brahmi-Net: A transliteration and script conversion system for languages of the Indian subcontinent , 2015, NAACL.

[12]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[13]  Mykola Pechenizkiy,et al.  Graph-Based N-gram Language Identication on Short Texts , 2011 .

[14]  Alan W. Black,et al.  Speech Synthesis of Code-Mixed Text , 2016, LREC.

[15]  Claudia Barolo,et al.  Language independent phoneme mapping for foreign TTS , 2004, SSW.

[16]  Ming Zhou,et al.  Close to Human Quality TTS with Transformer , 2018, ArXiv.

[17]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[18]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[19]  Jatin Sharma,et al.  Query word labeling and Back Transliteration for Indian Languages: Shared task system description , 2013 .

[20]  Kenneth Heafield,et al.  Neural Machine Translation Techniques for Named Entity Transliteration , 2018, NEWS@ACL.

[21]  Monojit Choudhury,et al.  "ye word kis lang ka hai bhai?" Testing the Limits of Word level Language Identification , 2014, ICON.

[22]  Alan W. Black,et al.  Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text , 2016, SSW.

[23]  Haizhou Li,et al.  Report of NEWS 2018 Named Entity Transliteration Shared Task , 2018, NEWS@ACL.

[24]  Alan W. Black,et al.  Speech Synthesis for Mixed-Language Navigation Instructions , 2017, INTERSPEECH.

[25]  Sercan Ömer Arik,et al.  Deep Voice 3: 2000-Speaker Neural Text-to-Speech , 2017, ICLR 2018.

[26]  Alan W. Black,et al.  Foreign accents in synthetic speech: development and evaluation , 2005, INTERSPEECH.

[27]  Alan W. Black,et al.  On Building Mixed Lingual Speech Synthesis Systems , 2017, INTERSPEECH.

[28]  Rishiraj Saha Roy,et al.  Overview and Datasets of FIRE 2013 Track on Transliterated Search , 2013 .

[29]  Xu Tan,et al.  Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion , 2019, INTERSPEECH.

[30]  Jan Skoglund,et al.  LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).