TUKE-BNews-SK: Slovak Broadcast News Corpus Construction and Evaluation

This article presents an overview of the existing acoustical corpuses suitable for broadcast news automatic transcription task in the Slovak language. The TUKE-BNews-SK database created in our department was built to support the application development for automatic broadcast news processing and spontaneous speech recognition of the Slovak language. The audio corpus is composed of 479 Slovak TV broadcast news shows from public Slovak television called STV1 or “Jednotka” containing 265 hours of material and 186 hours of clean transcribed speech (4 hours subset extracted for testing purposes). The recordings were manually transcribed using Transcriber tool modified for Slovak annotators and automatic Slovak spell checking. The corpus design, acquisition, annotation scheme and pronunciation transcription is described together with corpus statistics and tools used. Finally the evaluation procedure using automatic speech recognition is presented on the broadcast news and parliamentary speeches test sets.

[1]  Martin Lojka,et al.  Slovak Automatic Transcription and Dictation System for the Judicial Domain , 2011 .

[2]  J. Žibert Development , Evaluation and Automatic Segmentation of Slovenian Broadcast News Speech Database , 2022 .

[3]  Mirjam Sepesy Maucec,et al.  Acquisition and Annotation of Slovenian Broadcast News Database , 2004, LREC.

[4]  Tatsuya Kawahara,et al.  Recent Development of Open-Source Speech Recognition Engine Julius , 2009 .

[5]  William J. Byrne,et al.  On large vocabulary continuous speech recognition of highly inflectional language - czech , 2001, INTERSPEECH.

[6]  Sadaoki Furui,et al.  Thai Broadcast News Corpus Construction and Evaluation , 2008, LREC.

[7]  Milos Cernak,et al.  Effective Triphone Mapping for Acoustic Modeling in Speech Recognition , 2011, INTERSPEECH.

[8]  João Paulo da Silva Neto,et al.  The COST278 Pan-European Broadcast News Database , 2004, LREC.

[9]  Torbjørn Svendsen,et al.  RUNDKAST: an Annotated Norwegian Broadcast News Speech Corpus , 2008, LREC.

[10]  David Graff An overview of Broadcast News corpora , 2002, Speech Commun..

[11]  Mireia Díez,et al.  KALAKA-2: a TV Broadcast Speech Database for the Recognition of Iberian Languages in Clean and Noisy Environments , 2012, LREC.

[12]  Narada D. Warakagoda,et al.  A Noise Robust Multilingual Reference Recogniser Based on Speechdat(II) , 2000, INTERSPEECH.

[13]  Chai Wutiwiwatchai,et al.  LOTUS-BN: A Thai broadcast news corpus and its research applications , 2009, 2009 Oriental COCOSDA International Conference on Speech Database and Assessments.

[14]  Jozef Juhar,et al.  Recent Progress in Development of Language Model for Slovak Large Vocabulary Continuous Speech Recognition , 2012 .

[15]  Guillaume Gravier,et al.  Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News , 2004, LREC.

[16]  Darjaa Sakhia,et al.  MobilDat-SK - a Mobile Telephone Extension to the SpeechDat-E SK Telephone Speech Database in Slovak , 2006 .

[17]  Milos Cernak,et al.  Rule-Based Triphone Mapping for Acoustic Modeling in Automatic Speech Recognition , 2011, TSD.

[18]  Olivier Galibert,et al.  The ETAPE corpus for the evaluation of speech-based TV content processing in the French language , 2012, LREC.

[19]  Mark Liberman,et al.  Transcriber: Development and use of a tool for assisting speech corpora production , 2001, Speech Commun..

[20]  Michal Kuba,et al.  Universal approach for sequential audio pattern search , 2013, 2013 Federated Conference on Computer Science and Information Systems.