论文信息 - TUNDRA: a multilingual corpus of found data for TTS research created with light supervision

TUNDRA: a multilingual corpus of found data for TTS research created with light supervision

Simple4All Tundra (version 1.0) is the first release of a standardised multilingual corpus designed for text-to-speech research with imperfect or found data. The corpus consists of approximately 60 hours of speech data from audiobooks in 14 languages, as well as utterance-level alignments obtained with a lightly-supervised process. Future versions of the corpus will include finer-grained alignment and prosodic annotation, all of which will be made freely available. This paper gives a general outline of the data collected so far, as well as a detailed description of how this has been done, emphasizing the minimal language-specific knowledge and manual intervention used to compile the corpus. To demonstrate its potential use, textto-speech systems have been built for all languages using unsupervised or lightly supervised methods, also briefly presented in the paper. Index Terms: multilingual corpus, light supervision, imperfect data, found data, text-to-speech, audiobook data

[1] Richard M. Schwartz,et al. Analysis of low-resource acoustic model self-training , 2009, INTERSPEECH.

[2] Hisashi Kawai,et al. An investigation of the impact of speech transcript errors on HMM voices , 2010, SSW.

[3] Sabine Buchholz,et al. Automatic Sentence Selection from Speech Corpora Including Diverse Speech for Improved HMM-TTS Synthesis Quality , 2011, INTERSPEECH.

[4] Pedro J. Moreno,et al. A factor automaton approach for the forced alignment of long speech recordings , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5] Kishore Prahallad,et al. Significance of early tagged contextual graphemes in grapheme based speech synthesis and recognition systems , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6] Luis Javier Rodríguez-Fuentes,et al. A simple and efficient method to align very long speech signals to acoustically imperfect transcriptions , 2012, INTERSPEECH.

[7] Oliver Watts,et al. Unsupervised and lightly-supervised learning for rapid construction of TTS systems in multiple languages from 'found' data: evaluation and analysis , 2013, SSW.

[8] Kishore Prahallad,et al. Automatic building of synthetic voices from large multi-paragraph speech databases , 2007, INTERSPEECH.

[9] Nuria Oliver,et al. Automatic synchronization of electronic and audio books via TTS alignment and silence filtering , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[10] A.W. Black,et al. Unit selection without a phoneme set , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[11] Oliver Watts,et al. Lightly supervised GMM VAD to use audiobook for speech synthesiser , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12] Oliver Watts,et al. Unsupervised learning for text-to-speech synthesis , 2013 .

[13] Simon King,et al. A grapheme-based method for automatic alignment of speech and text data , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[14] Kishore Prahallad,et al. Segmentation of Monologues in Audio Books for Building Synthetic Voices , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[15] Julie Carson-Berndsen,et al. Clustering Expressive Speech Styles in Audiobooks Using Glottal Source Parameters , 2011, INTERSPEECH.

[16] Olivier Boëffard,et al. Towards Fully Automatic Annotation of Audio Books for TTS , 2012, LREC.

[17] Simon King,et al. Speech synthesis without a phone inventory , 2009, INTERSPEECH.

[18] Simon King,et al. The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate , 2011, Speech Commun..

[19] Simon King,et al. Lightly supervised discriminative training of grapheme models for improved sentence-level alignment of speech and text data , 2013, INTERSPEECH.

[20] Pedro J. Moreno,et al. A recursive algorithm for the forced alignment of very long audio segments , 1998, ICSLP.

[21] Mark J. F. Gales,et al. Lightly supervised recognition for automatic alignment of large coherent speech recordings , 2010, INTERSPEECH.

[22] Simon King,et al. Multisyn: Open-domain unit selection for the Festival speech synthesis system , 2007, Speech Commun..

[23] Simon King,et al. Simple4All proposals for the Albayzin Evaluations in Speech Synthesis , 2012 .