Automatic Speech Corpus Construction from Broadcasting Speech Databases

The speech corpus often needs to be constructed frequently for the diversified speech synthesis. This paper discusses our efforts on construction of speech corpus automatically from broadcasting speech databases for trainable Text-To-Speech (TTS) system. We present a new framework of automatic speech corpus construction from broadcasting speech databases. We select the clean speech audios from the broadcasting audios with a music detector which is based on speech/music discrimination. An automatic speech sentence segmentation system is used to generate the sentence database from the clean speech audios. At last, a text corpus construction method selects appropriate sentences speech which is maximizing the coverage of the sentence database’s diphones. Experiments show that our method can generate a good speech corpus rapidly with minimum manual intervention.

[1]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[2]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[3]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[4]  Andrej Ljolje,et al.  Automatic speech segmentation for concatenative inventory selection , 1994, SSW.

[5]  Olivier Rosec,et al.  A fusion approach for automatic speech segmentation of large corpora with application to speech synthesis , 2008, Speech Commun..

[6]  Kishore Prahallad,et al.  Sub-Phonetic Modeling For Capturing Pronunciation Variations For Conversational Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[7]  Ye Deng,et al.  Automatic Construction for a TTS Corpus with Limited Text , 2010, 2010 International Conference on Measuring Technology and Mechatronics Automation.

[8]  Kevyn Collins-Thompson,et al.  Prominence prediction for supersentential prosodic modeling based on a new database , 2004, SSW.

[9]  Alan W. Black,et al.  CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling , 2006, INTERSPEECH.

[10]  Maurizio Omologo,et al.  Automatic segmentation and labeling of speech based on Hidden Markov Models , 1993, Speech Commun..

[11]  Qun Zhao,et al.  Co-training Approach for Label-Minimized Audio Classification , 2010, 2010 International Conference on Measuring Technology and Mechatronics Automation.

[12]  Alex Acero,et al.  Whistler: a trainable text-to-speech system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[13]  L. Hansen Large Sample Properties of Generalized Method of Moments Estimators , 1982 .

[14]  Kishore Prahallad,et al.  Automatic building of synthetic voices from large multi-paragraph speech databases , 2007, INTERSPEECH.

[15]  Sergios Theodoridis,et al.  Speech/music discrimination for radio broadcasts using a hybrid HMM-Bayesian Network architecture , 2006, 2006 14th European Signal Processing Conference.

[16]  Ranran Du,et al.  Automatic Speech Sentence Segmentation from Multi-paragraph Databases , 2010, 2010 International Conference on Measuring Technology and Mechatronics Automation.

[17]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.