TEDxSK and JumpSK: A New Slovak Speech Recognition Dedicated Corpus

Abstract This paper describes a new Slovak speech recognition dedicated corpus built from TEDx talks and Jump Slovakia lectures. The proposed speech database consists of 220 talks and lectures in total duration of about 58 hours. Annotated speech database was generated automatically in an unsupervised manner by using acoustic speech segmentation based on principal component analysis and automatic speech transcription using two complementary speech recognition systems. The evaluation data consisting of 50 manually annotated talks and lectures in total duration of about 12 hours, has been created for evaluation of the quality of Slovak speech recognition. By unsupervised automatic annotation of TEDx talks and Jump Slovakia lectures we have obtained 21.26% of new speech segments with approximately 9.44% word error rate, suitable for retraining or adaptation of acoustic models trained beforehand.

[1]  Chiori Hori,et al.  A lecture transcription system combining neural network acoustic and language models , 2013, INTERSPEECH.

[2]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[3]  Matthias Wölfel,et al.  The ISL Baseline Lecture Transcription System for the TED Corpus , 2005 .

[4]  Paul Deléglise,et al.  Improvements to the LIUM French ASR system based on CMU sphinx: what helps to significantly reduce the word error rate? , 2009, INTERSPEECH.

[5]  Tatsuya Kawahara,et al.  Automatic transformation of lecture transcription into document style using statistical framework , 2004, INTERSPEECH.

[6]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[7]  James R. Glass,et al.  Language model parameter estimation using user transcriptions , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Jozef Juhar,et al.  Unsupervised acoustic corpora building based on variable confidence measure thresholding , 2016, 2016 International Symposium ELMAR.

[9]  Tatsuya Kawahara,et al.  Automatic Transcription of Lecture Speech using Language Model Based on Speaking-Style Transformation of Proceeding Texts , 2012, INTERSPEECH.

[10]  Jozef Juhar,et al.  Unsupervised speech transcription and alignment based on two complementary ASR systems , 2016, 2016 26th International Conference Radioelektronika (RADIOELEKTRONIKA).

[11]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[12]  S. Jan,et al.  Modeling of Slovak Language for Broadcast News Transcription , 2015 .

[13]  Jozef Juhar,et al.  Adding filled pauses and disfluent events into language models for speech recognition , 2016, 2016 7th IEEE International Conference on Cognitive Infocommunications (CogInfoCom).

[14]  Milos Cernak,et al.  Effective Triphone Mapping for Acoustic Modeling in Speech Recognition , 2011, INTERSPEECH.

[15]  Martin Lojka,et al.  An Extension of the Slovak Broadcast News Corpus based on Semi-Automatic Annotation , 2016, LREC.

[16]  Marián Trnka,et al.  Advances in the Slovak Judicial Domain Dictation System , 2013, LTC.

[17]  Stanislav Ondas,et al.  Online natural language processing of the Slovak Language , 2014, 2014 5th IEEE Conference on Cognitive Infocommunications (CogInfoCom).

[18]  Mauro Cettolo,et al.  Language modeling and transcription of the TED corpus lectures , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[19]  Jozef Juhar,et al.  Hypothesis combination for Slovak dictation speech recognition , 2014, Proceedings ELMAR-2014.

[20]  Thomas Niesler,et al.  Unsupervised language model adaptation for lecture speech transcription , 2002, INTERSPEECH.

[21]  Paul Deléglise,et al.  TED-LIUM: an Automatic Speech Recognition dedicated corpus , 2012, LREC.

[22]  Mirjam Sepesy Maucec,et al.  The SI TEDx-UM speech database: a new Slovenian Spoken Language Resource , 2016, LREC.

[23]  Paul Deléglise,et al.  Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks , 2014, LREC.

[24]  Martin Lojka,et al.  Query-by-example retrieval via fast sequential dynamic time warping algorithm , 2015, 2015 38th International Conference on Telecommunications and Signal Processing (TSP).

[25]  Fabio Brugnara,et al.  Advances in the automatic transcription of lectures , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.