Using exemplar responses for training and evaluating automated speech scoring systems

Automated scoring engines are usually trained and evaluated against human scores and compared to the benchmark of human-human agreement. In this paper we compare the performance of an automated speech scoring engine using two corpora: a corpus of almost 700,000 randomly sampled spoken responses with scores assigned by one or two raters during operational scoring, and a corpus of 16,500 exemplar responses with scores reviewed by multiple expert raters. We show that the choice of corpus used for model evaluation has a major effect on estimates of system performance with r varying between 0.64 and 0.80. Surprisingly, this is not the case for the choice of corpus for model training: when the training corpus is sufficiently large, the systems trained on different corpora showed almost identical performance when evaluated on the same corpus. We show that this effect is consistent across several learning algorithms. We conclude that evaluating the model on a corpus of exemplar responses if one is available provides additional evidence about system validity; at the same time, investing effort into creating a corpus of exemplar responses for model training is unlikely to lead to a substantial gain in model performance.

[1]  Xiaoming Xi,et al.  Automatic scoring of non-native spontaneous speech in tests of spoken English , 2009, Speech Commun..

[2]  Lei Chen Utilizing Cumulative Logit Model and Human Computation on Automated Speech Assessment , 2012, BEA@NAACL-HLT.

[3]  David M. Williamson,et al.  A Framework for Evaluation and Use of Automated Scoring , 2012 .

[4]  Larry Davis,et al.  The influence of training and experience on rater performance in scoring spoken language , 2016 .

[5]  Dirk Hovy,et al.  Learning part-of-speech taggers with inter-annotator agreement loss , 2014, EACL.

[6]  Iryna Gurevych,et al.  Noise or additional information? Leveraging crowdsource annotation item agreement for natural language tasks. , 2015, EMNLP.

[7]  Jean Carletta,et al.  Squibs: Reliability Measurement without Limits , 2008, CL.

[8]  Thomas Eckes,et al.  Rater types in writing performance assessments: A classification approach to rater variability , 2008 .

[9]  Tom A. B. Snijders,et al.  Multilevel Analysis , 2011, International Encyclopedia of Statistical Science.

[10]  Nitin Madnani,et al.  The Impact of Training Data on Automated Short Answer Scoring Performance , 2015, BEA@NAACL-HLT.

[11]  Xiaoming Xi,et al.  A study on the impact of fatigue on human raters when scoring speaking responses , 2014 .

[12]  Roy Schwartz,et al.  Neutralizing Linguistically Problematic Annotations in Unsupervised Dependency Parsing Evaluation , 2011, ACL.

[13]  Barbara Plank,et al.  Learning to parse with IAA-weighted loss , 2015, HLT-NAACL.

[14]  Anastassia Loukina,et al.  Feature selection for automated speech scoring , 2015, BEA@NAACL-HLT.

[15]  Xiaoming Xi,et al.  A three-stage approach to the automated scoring of spontaneous spoken responses , 2011, Comput. Speech Lang..

[16]  Beata Beigman Klebanov,et al.  Difficult Cases: From Data to Learning, and Back , 2014, ACL.

[17]  Robert Mannell,et al.  Does a rater’s familiarity with a candidate’s pronunciation affect the rating in oral proficiency interviews? , 2011 .

[18]  Zhen Wang,et al.  Monitoring of Scoring Using the e‐rater® Automated Scoring System and Human Raters on a Writing Test , 2014 .

[19]  Lei Chen,et al.  Exploring deep learning architectures for automatically grading non-native spontaneous speech , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Anastassia Loukina,et al.  Speech- and Text-driven Features for Automated Scoring of English Speaking Tasks , 2017, SCNLP@EMNLP 2017.

[21]  Jian Cheng,et al.  Fluency and structural complexity as predictors of L2 oral proficiency , 2010, INTERSPEECH.

[22]  Karl J. Friston,et al.  Variance Components , 2003 .

[23]  Magdalena Wolska,et al.  Finding a Tradeoff between Accuracy and Rater's Workload in Grading Clustered Short Answers , 2014, LREC.

[24]  Randall D. Penfield Fairness in Test Scoring , 2016 .

[25]  Torsten Zesch,et al.  Reducing Annotation Efforts in Supervised Short Answer Scoring , 2015, BEA@NAACL-HLT.

[26]  Anastassia Loukina,et al.  Building Better Open-Source Tools to Support Fairness in Automated Scoring , 2017, EthNLP@EACL.

[27]  Skipper Seabold,et al.  Statsmodels: Econometric and Statistical Modeling with Python , 2010, SciPy.

[28]  William Wresch,et al.  The Imminence of Grading Essays by Computer-25 Years Later , 1993 .

[29]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[30]  Jill Burstein,et al.  AUTOMATED ESSAY SCORING WITH E‐RATER® V.2.0 , 2004 .

[31]  Beata Beigman Klebanov,et al.  Squibs: From Annotator Agreement to Noise Models , 2009, CL.

[32]  Beata Beigman Klebanov,et al.  Learning with Annotation Noise , 2009, ACL.

[33]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .