Automatic scoring of non-native spontaneous speech in tests of spoken English

This paper presents the first version of the SpeechRater^S^M system for automatically scoring non-native spontaneous high-entropy speech in the context of an online practice test for prospective takers of the Test of English as a Foreign Language^(R) internet-based test (TOEFL^(R) iBT). The system consists of a speech recognizer trained on non-native English speech data, a feature computation module, using speech recognizer output to compute a set of mostly fluency based features, and a multiple regression scoring model which predicts a speaking proficiency score for every test item response, using a subset of the features generated by the previous component. Experiments with classification and regression trees (CART) complement those performed with multiple regression. We evaluate the system both on TOEFL Practice data [TOEFL Practice Online (TPO)] as well as on Field Study data collected before the introduction of the TOEFL iBT. Features are selected by test development experts based on both their empirical correlations with human scores as well as on their coverage of the concept of communicative competence. We conclude that while the correlation between machine scores and human scores on TPO (of 0.57) still differs by 0.17 from the inter-human correlation (of 0.74) on complete sets of six items (Pearson r correlation coefficients), the correlation of 0.57 is still high enough to warrant the deployment of the system in a low-stakes practice environment, given its coverage of several important aspects of communicative competence such as fluency, vocabulary diversity, grammar, and pronunciation. Another reason why the deployment of the system in a low-stakes practice environment is warranted is that this system is an initial version of a long-term research and development program where features related to vocabulary, grammar, and content will be added in a later stage when automatic speech recognition performance improves, which can then be easily achieved without a re-design of the system. Exact agreement on single TPO items between our system and human scores was 57.8%, essentially at par with inter-human agreement of 57.2%. Our system has been in operational use to score TOEFL Practice Online Speaking tests since the Fall of 2006 and has since scored tens of thousands of tests.

[1]  L. Boves,et al.  Quantitative assessment of second language learners' fluency: comparisons between read and spontaneous speech. , 2002, The Journal of the Acoustical Society of America.

[2]  Lawrence M. Rudner,et al.  An Evaluation of the IntelliMetric[SM] Essay Scoring System. , 2006 .

[3]  Lyle F. Bachman 语言测试要略 = Fundamental considerations in language testing , 1990 .

[4]  David M. Williamson,et al.  "Mental Model" Comparison of Automated and Human Scoring , 1999 .

[5]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[6]  Lou Boves,et al.  Different aspects of expert pronunciation quality ratings and their relation to scores produced by speech recognition algorithms , 2000, Speech Commun..

[7]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[8]  Jill Burstein,et al.  AUTOMATED ESSAY SCORING WITH E‐RATER® V.2.0 , 2004 .

[9]  Klaus Zechner,et al.  Towards Automatic Scoring of Non-Native Spontaneous Speech , 2006, HLT-NAACL.

[10]  Mitch Weintraub,et al.  Automatic scoring of pronunciation quality , 2000, Speech Commun..

[11]  Lawrence M. Rudner,et al.  An Evaluation of IntelliMetric™ Essay Scoring System , 2006 .

[12]  Stephen G. Clyman,et al.  Development of Automated Scoring Algorithms for Complex Performance Assessments: A Comparison of Two Approaches. , 1997 .

[13]  Vassilios Digalakis,et al.  Combination of machine scores for automatic grading of pronunciation quality , 2000, Speech Commun..

[14]  William Wresch,et al.  The Imminence of Grading Essays by Computer-25 Years Later , 1993 .

[15]  M. Chodorow,et al.  BEYOND ESSAY LENGTH: EVALUATING E-RATER®'S PERFORMANCE ON TOEFL® ESSAYS , 2004 .

[16]  Brian North,et al.  The development of a common framework scale of language proficiency , 2000 .

[17]  Martin Chodorow,et al.  Beyond Essay Length: Evaluating e-rater[R]'s Performance on TOEFL[R] Essays. Research Reports. Report 73. RR-04-04. , 2004 .

[18]  Susan T. Dumais,et al.  The latent semantic analysis theory of knowledge , 1997 .

[19]  Helmer Strik,et al.  Using speech recognition technology to assess foreign speakers' pronunciation of Dutch , 1997 .

[20]  David B. Pisoni,et al.  Two Experiments on Automatic Scoring of Spoken Language Proficiency , 2000 .

[21]  Helmer Strik,et al.  Automatic evaluation of Dutch pronunciation by using speech recognition technology , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[22]  Kristin Precoda,et al.  The SRI EduSpeak System: Recognition and Pronunciation Scoring for Language Learning , 2007 .

[23]  Martin Chodorow,et al.  C-rater: Automated Scoring of Short-Answer Questions , 2003, Comput. Humanit..

[24]  Martin Chodorow,et al.  Computer Analysis of Essay Content for Automated Score Prediction , 1998 .

[25]  Claudia Leacock Scoring Free-Responses Automatically: A Case Study of a Large-Scale Assessment , 2004 .

[26]  Robert C. Moore HLT-NAACL 2006 : Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics : proceedings of the main conference : June 4-9, 2006, New York, New York, USA , 2006 .

[27]  Lyle F. Bachman,et al.  Language testing in practice : designing and developing useful language tests , 1996 .

[28]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[29]  Xiaoming Xi,et al.  INVESTIGATING THE UTILITY OF ANALYTIC SCORING FOR THE TOEFL ACADEMIC SPEAKING TEST (TAST) , 2006 .

[30]  L. Boves,et al.  Quantitative assessment of second language learners' fluency by means of automatic speech recognition technology. , 2000, The Journal of the Acoustical Society of America.