SpeeD's DNN approach to Romanian speech recognition

This paper presents the main improvements brought recently to the large-vocabulary, continuous speech recognition (LVCSR) system for Romanian language developed by the Speech and Dialogue (SpeeD) research laboratory. While the most important improvement consists in the use of DNN-based acoustic models, instead of the classic HMM-GMM approach, several other aspects are discussed in the paper: a significant increase of the speech training corpus, the use of additional algorithms for feature processing, speaker adaptive training, and discriminative training and, finally, the use of lattice rescoring with significantly expanded language models (n-gram models up to order 5, based on vocabularies of up to 200k words). The ASR experiments were performed with several types of acoustic and language models in different configurations on the standard read and conversational speech corpora created by SpeeD in 2014. The results show that the extension of the training speech corpus leads to a relative word error rate (WER) improvement between 15% and 17%, while the use of DNN-based acoustic models instead of HMM-GMM-based acoustic models leads to a relative WER improvement between 18% and 23%, depending on the nature of the evaluation speech corpus (read or conversational, clean or noisy). The best configuration of the LVCSR system was integrated as a live transcription web application available online on SpeeD laboratory's website at https://speed.pub.ro/live-transcriber-2017.

[1]  James R. Glass,et al.  A complete KALDI recipe for building Arabic speech recognition systems , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[2]  Horia Cucu,et al.  Recent improvements of the SpeeD Romanian LVCSR system , 2014, 2014 10th International Conference on Communications (COMM).

[3]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Mihaela Oprea,et al.  Using neural networks for a discriminant speech recognition system , 2014, 2014 International Conference on Development and Application Systems (DAS).

[5]  Stavros Tsakalidis,et al.  Linear transforms in automatic speech recognition: estimation procedures and integration of diverse acoustic data , 2006 .

[6]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[7]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[8]  P. Mihajlik,et al.  Broadcast news transcription in Central-East European languages , 2012, 2012 IEEE 3rd International Conference on Cognitive Infocommunications (CogInfoCom).

[9]  Geoffrey Zweig,et al.  Combining forward and backward search in decoding , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Piero Cosi A KALDI-DNN-based ASR system for Italian , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[11]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[12]  Tanel Alumäe,et al.  Full-duplex Speech-to-text System for Estonian , 2014, Baltic HLT.

[13]  Tara N. Sainath,et al.  Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[14]  Horia Cucu,et al.  A robust diacritics restoration system using unreliable raw text data , 2014, SLTU.

[15]  M. Zirra,et al.  A hybrid NN-HMM system for connected digit recognition over telephone in Romanian language , 1996, Proceedings of IVTTA '96. Workshop on Interactive Voice Technology for Telecommunications Applications.

[16]  Horia Cucu,et al.  Unsupervised acoustic model training using multiple seed ASR systems , 2014, SLTU.

[17]  M. Oprea,et al.  An artificial neural network-based isolated word speech recognition system for the Romanian language , 2012, 2012 16th International Conference on System Theory, Control and Computing (ICSTCC).

[18]  Chi Mai Luong,et al.  Tonal phoneme based model for Vietnamese LVCSR , 2015, 2015 International Conference Oriental COCOSDA held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[19]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Homayoon Beigi,et al.  Fundamentals of Speaker Recognition , 2011 .

[21]  Reza Sahraeian,et al.  Using generalized maxout networks and phoneme mapping for low resource ASR- a case study on Flemish-Afrikaans , 2015, 2015 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech).

[22]  Hagen Soltau,et al.  Fast speaker adaptive training for speech recognition , 2008, INTERSPEECH.

[23]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[24]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[26]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).