The SWARA speech corpus: A large parallel Romanian read speech dataset

This paper introduces one of the largest Romanian speech datasets freely available for both academic and commercial use. The dataset comprises speech data recorded over the last year from 12 speakers, along with 5 other speakers previously recorded in a separate environment. The data was manually segmented at utterance-level and semi-automatically labelled at phone-level. The resulting corpus amounts to approximately 21 hours of high-quality read speech data, split into over 19,000 utterances. The speakers read between 921 and 1493 utterances each. 880 utterances are common to all speakers and add up to over 16 hours of parallel data. We present the steps of performing the recordings and data segmentation, as well as a first use of this corpus in the context of synthetic voice development.

[1]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[2]  Simon King,et al.  The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate , 2011, Speech Commun..

[3]  Ozlem Kalinli Combination of auditory attention features with phone posteriors for better automatic phoneme segmentation , 2013, INTERSPEECH.

[4]  Cosmin Munteanu,et al.  Design, Collection, and Annotation of a Romanian Speech Database , 1998 .

[5]  Maurizio Omologo,et al.  Automatic segmentation and labeling of speech based on Hidden Markov Models , 1993, Speech Commun..

[6]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[7]  Keiichi Tokuda,et al.  Eigenvoices for HMM-based speech synthesis , 2002, INTERSPEECH.

[8]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[9]  John-Paul Hosom,et al.  Speaker-independent phoneme alignment using transition-dependent states , 2009, Speech Commun..

[10]  Horia Cucu,et al.  Recent improvements of the SpeeD Romanian LVCSR system , 2014, 2014 10th International Conference on Communications (COMM).

[11]  DIANA BIBIRI,et al.  ROMANIAN CORPUS FOR SPEECH-TO-TEXT ALIGNMENT ANCA – , 2013 .

[12]  Mircea Giurgiu,et al.  A Romanian corpus for speech perception and automatic speech recognition , 2011 .

[13]  Phil Hoole,et al.  Announcing the Electromagnetic Articulography (Day 1) Subset of the mngu0 Articulatory Corpus , 2011, INTERSPEECH.

[14]  Oliver Watts,et al.  TUNDRA: a multilingual corpus of found data for TTS research created with light supervision , 2013, INTERSPEECH.

[16]  Mircea Giurgiu,et al.  Improving sentence-level alignment of speech with imperfect transcripts using utterance concatenation and VAD , 2016, 2016 IEEE 12th International Conference on Intelligent Computer Communication and Processing (ICCP).