TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation

In this paper, we present TED-LIUM release 3 corpus (TED-LIUM 3 is available on https://lium.univ-lemans.fr/ted-lium3/) dedicated to speech recognition in English, which multiplies the available data to train acoustic models in comparison with TED-LIUM 2, by a factor of more than two. We present the recent development on Automatic Speech Recognition (ASR) systems in comparison with the two previous releases of the TED-LIUM Corpus from 2012 and 2014. We demonstrate that, passing from 207 to 452 h of transcribed speech training data is really more useful for end-to-end ASR systems than for HMM-based state-of-the-art ones. This is the case even if the HMM-based ASR system still outperforms the end-to-end ASR system when the size of audio training data is 452 h, with a Word Error Rate (WER) of 6.7% and 13.7%, respectively. Finally, we propose two repartitions of the TED-LIUM release 3 corpus: the legacy repartition that is the same as that existing in release 2, and a new repartition, calibrated and designed to make experiments on speaker adaptation. Similar to the two first releases, TED-LIUM 3 corpus will be freely available for the research community.

[1]  Yiming Wang,et al.  Low Latency Acoustic Modeling Using Temporal Convolution and LSTMs , 2018, IEEE Signal Processing Letters.

[2]  Chong Wang,et al.  Lookahead Convolution Layer for Unidirectional Recurrent Neural Networks , 2016 .

[3]  Daniel Jurafsky,et al.  First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs , 2014, ArXiv.

[4]  Paul Deléglise,et al.  Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks , 2014, LREC.

[5]  Jian Wang,et al.  Neural Network Language Modeling with Letter-Based Features and Importance Sampling , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[7]  Yiming Wang,et al.  A Pruned Rnnlm Lattice-Rescoring Algorithm for Automatic Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Yiming Wang,et al.  Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks , 2018, INTERSPEECH.

[9]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[10]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[11]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[12]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[13]  Paul Deléglise,et al.  TED-LIUM: an Automatic Speech Recognition dedicated corpus , 2012, LREC.

[14]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[15]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[16]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[17]  Natalia A. Tomashenko,et al.  Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing , 2014, INTERSPEECH.