TUNDRA: a multilingual corpus of found data for TTS research created with light supervision

Simple4All Tundra (version 1.0) is the first release of a standardised multilingual corpus designed for text-to-speech research with imperfect or found data. The corpus consists of approximately 60 hours of speech data from audiobooks in 14 languages, as well as utterance-level alignments obtained with a lightly-supervised process. Future versions of the corpus will include finer-grained alignment and prosodic annotation, all of which will be made freely available. This paper gives a general outline of the data collected so far, as well as a detailed description of how this has been done, emphasizing the minimal language-specific knowledge and manual intervention used to compile the corpus. To demonstrate its potential use, textto-speech systems have been built for all languages using unsupervised or lightly supervised methods, also briefly presented in the paper. Index Terms: multilingual corpus, light supervision, imperfect data, found data, text-to-speech, audiobook data

[1]  Richard M. Schwartz,et al.  Analysis of low-resource acoustic model self-training , 2009, INTERSPEECH.

[2]  Hisashi Kawai,et al.  An investigation of the impact of speech transcript errors on HMM voices , 2010, SSW.

[3]  Sabine Buchholz,et al.  Automatic Sentence Selection from Speech Corpora Including Diverse Speech for Improved HMM-TTS Synthesis Quality , 2011, INTERSPEECH.

[4]  Pedro J. Moreno,et al.  A factor automaton approach for the forced alignment of long speech recordings , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Kishore Prahallad,et al.  Significance of early tagged contextual graphemes in grapheme based speech synthesis and recognition systems , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Luis Javier Rodríguez-Fuentes,et al.  A simple and efficient method to align very long speech signals to acoustically imperfect transcriptions , 2012, INTERSPEECH.

[7]  Oliver Watts,et al.  Unsupervised and lightly-supervised learning for rapid construction of TTS systems in multiple languages from 'found' data: evaluation and analysis , 2013, SSW.

[8]  Kishore Prahallad,et al.  Automatic building of synthetic voices from large multi-paragraph speech databases , 2007, INTERSPEECH.

[9]  Nuria Oliver,et al.  Automatic synchronization of electronic and audio books via TTS alignment and silence filtering , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[10]  A.W. Black,et al.  Unit selection without a phoneme set , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[11]  Oliver Watts,et al.  Lightly supervised GMM VAD to use audiobook for speech synthesiser , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Oliver Watts,et al.  Unsupervised learning for text-to-speech synthesis , 2013 .

[13]  Simon King,et al.  A grapheme-based method for automatic alignment of speech and text data , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[14]  Kishore Prahallad,et al.  Segmentation of Monologues in Audio Books for Building Synthetic Voices , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Julie Carson-Berndsen,et al.  Clustering Expressive Speech Styles in Audiobooks Using Glottal Source Parameters , 2011, INTERSPEECH.

[16]  Olivier Boëffard,et al.  Towards Fully Automatic Annotation of Audio Books for TTS , 2012, LREC.

[17]  Simon King,et al.  Speech synthesis without a phone inventory , 2009, INTERSPEECH.

[18]  Simon King,et al.  The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate , 2011, Speech Commun..

[19]  Simon King,et al.  Lightly supervised discriminative training of grapheme models for improved sentence-level alignment of speech and text data , 2013, INTERSPEECH.

[20]  Pedro J. Moreno,et al.  A recursive algorithm for the forced alignment of very long audio segments , 1998, ICSLP.

[21]  Mark J. F. Gales,et al.  Lightly supervised recognition for automatic alignment of large coherent speech recordings , 2010, INTERSPEECH.

[22]  Simon King,et al.  Multisyn: Open-domain unit selection for the Festival speech synthesis system , 2007, Speech Commun..

[23]  Simon King,et al.  Simple4All proposals for the Albayzin Evaluations in Speech Synthesis , 2012 .