An Extension of the Slovak Broadcast News Corpus based on Semi-Automatic Annotation

In this paper, we introduce an extension of our previously released TUKE-BNews-SK corpus based on a semi-automatic annotation scheme. It firstly relies on the automatic transcription of the BN data performed by our Slovak large vocabulary continuous speech recognition system. The generated hypotheses are then manually corrected and completed by trained human annotators. The corpus is composed of 25 hours of fully-annotated spontaneous and prepared speech. In addition, we have acquired 900 hours of another BN data, part of which we plan to annotate semi-automatically. We present a preliminary corpus evaluation that gives very promising results.

[1]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[2]  Hermann Ney,et al.  Unsupervised training of acoustic models for large vocabulary continuous speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[3]  Richard M. Schwartz,et al.  Unsupervised acoustic and language model training with small amounts of labelled data , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  S. Jan,et al.  Modeling of Slovak Language for Broadcast News Transcription , 2015 .

[5]  Jozef Juhar,et al.  Comparison of Diarization Tools for Building Speaker Database , 2015 .

[6]  Xiangang Li,et al.  Lightly Supervised Acoustic Model Training for Mandarin Continuous Speech Recognition , 2012, IScIDE.

[7]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[8]  Jozef Juhar,et al.  Hypothesis combination for Slovak dictation speech recognition , 2014, Proceedings ELMAR-2014.

[9]  Milos Cernak,et al.  Effective Triphone Mapping for Acoustic Modeling in Speech Recognition , 2011, INTERSPEECH.

[10]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[11]  Mark Liberman,et al.  Transcriber: Development and use of a tool for assisting speech corpora production , 2001, Speech Commun..

[12]  Martin Lojka,et al.  Slovak Automatic Dictation System for Judicial Domain , 2011, LTC.

[13]  Jozef Juhar,et al.  Interface for smart audiovisual data archive , 2015, 2015 25th International Conference Radioelektronika (RADIOELEKTRONIKA).

[14]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[15]  Matús Pleva,et al.  TUKE-BNews-SK: Slovak Broadcast News Corpus Construction and Evaluation , 2014, LREC.

[16]  Stanislav Ondas,et al.  Online natural language processing of the Slovak Language , 2014, 2014 5th IEEE Conference on Cognitive Infocommunications (CogInfoCom).