DNN Online with iVectors Acoustic Modeling and Doc2Vec Distributed Representations for Improving Automated Speech Scoring

When applying automated speech-scoring technology to the rating of globally administered real assessments, there are several practical challenges: (a) ASR accuracy on non-native spontaneous speech is generally low; (b) due to the data mismatch between an ASR systems training stage and its final usage, the recognition accuracy obtained in practice is even lower; (c) content-relevance was not widely used in the scoring models in operation due to various technical and logistical issues. For this paper, an ASR in a deep neural network (DNN) architecture of multi-splice with iVectors was trained and resulted in a performance at 19.1% word error rate (WER). Secondly, we applied language model (LM) adaptation for the prompts that were not covered in ASR training by using the spoken responses acquired from previous operational tests, and we were able to reduce the relative WER by more than 8%. The boosted ASR performance improves the scoring performance without any extra human annotation cost. Finally, the developed ASR system allowed us to apply content features in practice. Besides the conventional frequency-based approach, content vector analysis (CVA), we also explored distributed representations with Doc2Vec and found an improvement on content measurement.

[1]  J. Burstein Sentence similarity measures for essay coherence , 2007 .

[2]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[3]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[4]  Xin Chen,et al.  Deep neural network acoustic models for spoken assessment applications , 2015, Speech Commun..

[5]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Klaus Zechner,et al.  What did they actually say? agreement and disagreement among transcribers of non-native spontaneous speech responses in an English proficiency test , 2009, SLaTE.

[7]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[8]  Lei Chen,et al.  The Impact of Asr Accuracy on the Performance of an Automated Scoring Engine for Spoken Responses , .

[9]  Jian Cheng,et al.  Using deep neural networks to improve proficiency assessment for children English language learners , 2014, INTERSPEECH.

[10]  Lei Chen,et al.  Exploring deep learning architectures for automatically grading non-native spontaneous speech , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jian Cheng,et al.  Validating automated speaking tests , 2010 .

[12]  Sree Hari Krishnan Parthasarathi,et al.  Robust i-vector based adaptation of DNN acoustic model for speech recognition , 2015, INTERSPEECH.

[13]  Lei Chen,et al.  Evaluating Unsupervised Language Model Adaptation Methods for Speaking Assessment , 2013, BEA@NAACL-HLT.

[14]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[15]  Anastassia Loukina,et al.  Feature selection for automated speech scoring , 2015, BEA@NAACL-HLT.

[16]  Mark J. F. Gales,et al.  Automatically grading learners' English using a Gaussian process , 2015, SLaTE.

[17]  Koichi Shinoda Acoustic Model Adaptation for Speech Recognition , 2010, IEICE Trans. Inf. Syst..

[18]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[19]  Keelan Evanini,et al.  The influence of automatic speech recognition accuracy on the performance of an automated speech assessment system , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[20]  Klaus Zechner,et al.  Exploring Content Features for Automated Speech Scoring , 2012, HLT-NAACL.

[21]  Maxine Eskénazi,et al.  An overview of spoken language technology for education , 2009, Speech Commun..

[22]  Tristan Miller,et al.  Essay Assessment with Latent Semantic Analysis , 2003 .

[23]  Lei Chen Applying feature bagging for more accurate and robust automated speaking assessment , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[24]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[25]  Martin Karafi iVector-Based Discriminative Adaptation for Automatic Speech Recognition , 2011 .

[26]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[27]  Xiaoming Xi,et al.  Automatic scoring of non-native spontaneous speech in tests of spoken English , 2009, Speech Commun..

[28]  Themos Stafylakis,et al.  I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.