Slovak Broadcast News Speech Recognition and Transcription System

We have developed a working prototype of automatic subtitling system for transcription, archiving, and indexing of Slovak audiovisual recordings, such as lectures, talks, discussions or broadcast news. To go further in the development and research, we had to incorporate more and more modern speech technologies and embrace nowadays deep learning techniques. This paper describes transition and changes made to our working prototype regarding speech recognition core replacement, architecture changes and new web-based user interface. We have used the state-of-the art speech toolkit KALDI and distributed architecture to achieve better responsivity of the interface and faster processing of the audiovisual recordings. Using acoustic models based on time delay deep neural networks we have been able to lower the system’s average word error rate from previously reported 24% to 15%, absolutely.

[1]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[2]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[3]  Jozef Juhár,et al.  Semantic indexing and document retrieval for personalized language modeling , 2017, 2017 International Symposium ELMAR.

[4]  Martin Lojka,et al.  Slovak Automatic Dictation System for Judicial Domain , 2011, LTC.

[5]  Martin Lojka,et al.  An Extension of the Slovak Broadcast News Corpus based on Semi-Automatic Annotation , 2016, LREC.

[6]  Marián Trnka,et al.  Advances in the Slovak Judicial Domain Dictation System , 2013, LTC.

[7]  Mickael Rouvier,et al.  An open-source state-of-the-art toolbox for broadcast news diarization , 2013, INTERSPEECH.

[8]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[9]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[10]  Martin Lojka,et al.  Query-by-example retrieval via fast sequential dynamic time warping algorithm , 2015, 2015 38th International Conference on Telecommunications and Signal Processing (TSP).

[11]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[12]  Carlo Aliprandi,et al.  Automating live and batch subtitling of multimedia contents for several European languages , 2015, Multimedia Tools and Applications.

[13]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[14]  Tatsuya Kawahara,et al.  Automatic Transcription of Lecture Speech using Language Model Based on Speaking-Style Transformation of Proceeding Texts , 2012, INTERSPEECH.

[15]  Matús Pleva,et al.  TUKE-BNews-SK: Slovak Broadcast News Corpus Construction and Evaluation , 2014, LREC.