CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition

Recent end-to-end Automatic Speech Recognition (ASR) systems demonstrated the ability to outperform conventional hybrid DNN/HMM ASR. Aside from architectural improvements in those systems, those models grew in terms of depth, parameters and model capacity. However, these models also require more training data to achieve comparable performance.

[1]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[2]  Roland S. Kamzelak,et al.  Projekt Gutenberg-DE , 1999 .

[3]  Andrey Ronzhin,et al.  Speech and Computer , 2013, Lecture Notes in Computer Science.

[4]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Max Mühlhäuser,et al.  Open Source German Distant Speech Recognition: Corpus and Acoustic Model , 2015, TSD.

[6]  Francis M. Tyers,et al.  Common Voice: A Massively-Multilingual Speech Corpus , 2020, LREC.

[7]  Ngoc Thang Vu,et al.  IMS-Speech: A Speech to Text Tool , 2019, ArXiv.

[8]  Florian Schiel,et al.  Automatic Phonetic Transcription of Non-Prompted Speech , 1999 .

[9]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[10]  Manfred K. Warmuth,et al.  THE CMU SPHINX-4 SPEECH RECOGNITION SYSTEM , 2001 .

[11]  Paul Deléglise,et al.  Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks , 2014, LREC.

[12]  Arne Köhn,et al.  Open Source Automatic Speech Recognition for German , 2018, ITG Symposium on Speech Communication.

[13]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[14]  Jodi Kearns,et al.  LibriVox: Free Public Domain Audiobooks , 2014 .

[15]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[16]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[17]  Steve Young,et al.  The HTK hidden Markov model toolkit: design and philosophy , 1993 .

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.