Large Corpus of Czech Parliament Plenary Hearings

We present a large corpus of Czech parliament plenary sessions. The corpus consists of approximately 1200 hours of speech data and corresponding text transcriptions. The whole corpus has been segmented to short audio segments making it suitable for both training and evaluation of automatic speech recognition (ASR) systems. The source language of the corpus is Czech, which makes it a valuable resource for future research as only a few public datasets are available in the Czech language. We complement the data release with experiments of two baseline ASR systems trained on the presented data: the more traditional approach implemented in the Kaldi ASRtoolkit which combines hidden Markov models and deep neural networks (NN) and a modern ASR architecture implemented in Jaspertoolkit which uses deep NNs in an end-to-end fashion.

[1]  Mikko Kurimo,et al.  Automatic Construction of the Finnish Parliament Speech Corpus , 2017, INTERSPEECH.

[2]  Robert Frank,et al.  What lies beneath: Semantic and syntactic analysis of manually reconstructed spontaneous speech , 2009, ACL/IJCNLP.

[3]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Boris Ginsburg,et al.  Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq , 2018, 1805.10387.

[5]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[6]  Inga Rún Helgadóttir,et al.  Building an ASR Corpus Using Althingi's Parliamentary Speeches , 2017, INTERSPEECH.

[7]  Ondřej Plátek,et al.  Vystadial 2016 – Czech data , 2016 .

[8]  Diana Geneva,et al.  Building an ASR Corpus Based on Bulgarian Parliament Speeches , 2019, SLSP.

[9]  Marie Mikulová,et al.  Prague DaTabase of Spoken Czech 1.0 , 2017 .

[10]  Alfons Juan-Císcar,et al.  Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Daniel Jurafsky,et al.  First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs , 2014, ArXiv.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[14]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Kyu J. Han,et al.  State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention with Dilated 1D Convolutions , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[16]  Aleš Pražák,et al.  Czech Parliament Meetings , 2012 .

[17]  Jan Niehues,et al.  Multilingual Disfluency Removal using NMT , 2016 .

[18]  Boris Ginsburg,et al.  Jasper: An End-to-End Convolutional Neural Acoustic Model , 2019, INTERSPEECH.

[19]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[20]  Peter Bell,et al.  Sequence-to-sequence models for punctuated transcription combining lexical and acoustic features , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .