Text to speech in new languages without a standardized orthography

Many spoken languages do not have a standardized writing system. Building text to speech voices for them, without accurate transcripts of speech data is difficult. Our language independent method to bootstrap synthetic voices using only speech data relies upon cross-lingual phonetic decoding of speech. In this paper, we describe novel additions to our bootstrapping method. We present results on eight different languages---English, Dari, Pashto, Iraqi, Thai, Konkani, Inupiaq and Ojibwe, from different language families and show that our phonetic voices can be made understandable with as little as an hour of speech data that never had transcriptions, and without many resources in the target language available. We also present purely acoustic techniques that can help induce syllable and word level information that can further improve the intelligibility of these voices.

[1]  Shrikanth S. Narayanan,et al.  Factored translation models for enriching spoken language translation with prosody , 2008, INTERSPEECH.

[2]  George Zavaliagkos,et al.  Utilizing untranscribed training data to improve perfomance , 1998, LREC.

[3]  Tanja Schultz,et al.  SPICE: web-based tools for rapid language adaptation in speech processing systems , 2007, INTERSPEECH.

[4]  Jordi Adell,et al.  Prosody Generation for Speech-to-Speech Translation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[5]  Alan W. Black,et al.  Text-To-Speech for Languages without an Orthography , 2012, COLING.

[6]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[7]  Alan W. Black,et al.  Bootstrapping Text-to-Speech for speech processing in languages without an orthography , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Andy Way,et al.  Hierarchical Phrase-Based MT for Phonetic Representation-Based Speech Translation , 2012, AMTA.

[9]  Alan W. Black,et al.  CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling , 2006, INTERSPEECH.

[10]  Tomoki Toda,et al.  Evaluation of cross-language voice conversion based on GMM and straight , 2001, INTERSPEECH.

[11]  Bowen Zhou,et al.  TOWARDS SPEECH TRANSLATION OF NON WRITTEN LANGUAGES , 2006, 2006 IEEE Spoken Language Technology Workshop.

[12]  Alan W. Black,et al.  Accent Group modeling for improved prosody in statistical parameteric speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Richard Zens,et al.  Speech Translation by Confusion Network Decoding , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[14]  Sebastian Stüker,et al.  Towards human translations guided language discovery for ASR systems , 2008, SLTU.

[15]  Bowen Zhou,et al.  On Efficient Coupling of ASR and SMT for Speech Translation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[16]  Micha Elsner,et al.  Bootstrapping a Unified Model of Lexical and Phonetic Acquisition , 2012, ACL.

[17]  Richard M. Stern,et al.  The 1996 Hub-4 Sphinx-3 System , 1997 .

[18]  Keiichi Tokuda,et al.  The blizzard challenge - 2005: evaluating corpus-based speech synthesis on common datasets , 2005, INTERSPEECH.