Complementary strengths? Evaluation of a hybrid human-machine scoring approach for a test of oral academic English

ABSTRACT Human raters and machine scoring systems potentially have complementary strengths in evaluating language ability; specifically, it has been suggested that automated systems might be used to make consistent measurements of specific linguistic phenomena, whilst humans evaluate more global aspects of performance. We report on an empirical study that explored the possibility of combining human and machine scores using responses from the speaking section of the TOEFL iBT® test. Human raters awarded scores for three sub-constructs: delivery, language use and topic development. The SpeechRaterSM automated scoring system produced scores for delivery and language use. Composite scores computed from three different combinations of human and automated analytic scores were equally or more reliable than human holistic scores, probably due to the inclusion of multiple observations in composite scores. However, composite scores calculated solely from human analytic scores showed the highest reliability and reliability steadily decreased as more machine scores replaced human scores.

[1]  K. Zechner Summary and Outlook on Automated Speech Scoring , 2019 .

[2]  B. Tabachnick,et al.  Using multivariate statistics, 5th ed. , 2007 .

[3]  E. Pedhazur Multiple Regression in Behavioral Research: Explanation and Prediction , 1982 .

[4]  E. D. Zhang Automated Speaking Assessment: Using Language Technologies to Score Spontaneous Speech , 2020 .

[5]  Claudia Harsch,et al.  Comparing holistic and analytic scoring methods: issues of validity and reliability , 2013 .

[6]  Dean Luo,et al.  Investigation of the effects of automatic scoring technology on human raters' performances in L2 speech proficiency assessment , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[7]  T. Isaacs Fully automated speaking assessments: Changes to proficiency testing and the role of pronunciation , 2017 .

[8]  Xiaoming Xi,et al.  Automated Scoring of Spontaneous Speech Using SpeechRater? v1.0. Research Report. ETS RR-08-62. , 2008 .

[9]  Xiaoming Xi,et al.  A three-stage approach to the automated scoring of spontaneous spoken responses , 2011, Comput. Speech Lang..

[10]  Xiaoming Xi,et al.  A comparison of two scoring methods for an automated speech scoring system , 2012 .

[11]  S. Haberman,et al.  An NCME Instructional Module on Subscores , 2011 .

[12]  David M. Williamson,et al.  A Framework for Evaluation and Use of Automated Scoring , 2012 .

[13]  S. Sinharay,et al.  INVESTIGATING THE VALUE OF SECTION SCORES FOR THE TOEFL iBT® TEST , 2013 .

[14]  Anastassia Loukina,et al.  Building Better Open-Source Tools to Support Fairness in Automated Scoring , 2017, EthNLP@EACL.

[15]  Brent Bridgeman Human Ratings and Automated Essay Evaluation , 2013 .

[16]  Xiaoming Xi,et al.  AUTOMATED SCORING OF SPONTANEOUS SPEECH USING SPEECHRATERSM V1.0 , 2008 .

[17]  David M. Williamson,et al.  Evaluation of the e‐rater® Scoring Engine for the TOEFL® Independent and Integrated Prompts , 2012 .

[18]  Su-Youn Yoon,et al.  Combining human and automated scores for the improved assessment of non-native speech , 2017, Speech Commun..

[19]  Anastassia Loukina,et al.  Scoring and Filtering Models for Automated Speech Scoring , 2019 .

[20]  Larry S. Davis,et al.  Automated Scoring of Speaking Tasks in the Test of English‐for‐Teaching (TEFT™) , 2015 .

[21]  Diana Adler,et al.  Using Multivariate Statistics , 2016 .

[22]  Anastassia Loukina,et al.  Expert and crowdsourced annotation of pronunciation errors for automatic scoring systems , 2015, INTERSPEECH.

[23]  Xiaoming Xi Evaluating analytic scoring for the TOEFL® Academic Speaking Test (TAST) for operational use , 2007 .

[24]  Thomas Quinlan,et al.  Complementing human judgment of essays written by English language learners with e-rater® scoring , 2010 .

[25]  Talia Isaacs,et al.  Shifting Sands in Second Language Pronunciation Teaching and Assessment Research and Practice , 2018, Language Assessment Quarterly.

[26]  J. Jamieson,et al.  DEVELOPING ANALYTIC RATING GUIDES FOR TOEFL IBT'S INTEGRATED SPEAKING TASKS , 2013 .

[27]  Anastassia Loukina,et al.  Feature selection for automated speech scoring , 2015, BEA@NAACL-HLT.

[28]  Anastassia Loukina,et al.  Automated Scoring of Nonnative Speech Using the  SpeechRater SM v. 5.0 Engine , 2018 .

[29]  Xiaoming Xi,et al.  Evaluating analytic scoring for the TOEFL® Academic Speaking Test (TAST) for operational use , 2007 .