Machine Learning–Driven Language Assessment

We describe a method for rapidly creating language proficiency assessments, and provide experimental evidence that such tests can be valid, reliable, and secure. Our approach is the first to use machine learning and natural language processing to induce proficiency scales based on a given standard, and then use linguistic models to estimate item difficulty directly for computer-adaptive testing. This alleviates the need for expensive pilot testing with human subjects. We used these methods to develop an online proficiency exam called the Duolingo English Test, and demonstrate that its scores align significantly with other high-stakes English assessments. Furthermore, our approach produces test scores that are highly reliable, while generating item banks large enough to satisfy security requirements.

[1]  Wen-Ying Lin,et al.  Language Reduced Redundancy Tests: A Reexamination of Cloze Test and C-test , 2008 .

[2]  Walter D. Way Protecting the Integrity of Computerized Testing Item Pools , 1998 .

[3]  I. Kostin Exploring Item Characteristics That Are Related to the Difficulty of TOEFL Dialogue Items. Research Reports. RR-79. RR-04-11. , 2004 .

[4]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[5]  Carol A. Chapelle,et al.  The Meaning of Cloze Test Scores: An Item Difficulty Perspective , 1992 .

[6]  Georg Rasch,et al.  Probabilistic Models for Some Intelligence and Attainment Tests , 1981, The SAGE Encyclopedia of Research Design.

[7]  David Cohn,et al.  Active Learning , 2010, Encyclopedia of Machine Learning.

[8]  Stephen G. Sireci,et al.  ON THE RELIABILITY OF TESTLET‐BASED TESTS , 1991 .

[9]  Monique Reichert,et al.  The C-test, the TCF and the CEFR: a validation study. , 2010 .

[10]  Roy Freedle,et al.  THE PREDICTION OF TOEFL READING COMPREHENSION ITEM DIFFICULTY FOR EXPOSITORY PROSE PASSAGES FOR THREE ITEM TYPES: MAIN IDEA, INFERENCE, AND SUPPORTING IDEA ITEMS , 1993 .

[11]  Susan Nissan,et al.  AN ANALYSIS OF FACTORS AFFECTING THE DIFFICULTY OF DIALOGUE ITEMS IN TOEFL LISTENING COMPREHENSION , 1995 .

[12]  Jack Mostow,et al.  Generating Diagnostic Multiple Choice Comprehension Cloze Questions , 2012, BEA@NAACL-HLT.

[13]  M. Kane Validating the Interpretations and Uses of Test Scores , 2013 .

[14]  Wataru Suzuki,et al.  Elicited Imitation in Second Language Acquisition Research , 2007 .

[15]  G. Westhoff,et al.  Challenges and Opportunities of the CEFR for Reimagining Foreign Language Pedagogy , 2007 .

[16]  A. Capel Completing the English Vocabulary Profile : C1 and C2 vocabulary , 2012 .

[17]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[18]  Walt Detmar Meurers,et al.  Towards grounding computational linguistic approaches to readability: Modeling reader-text interaction for easy and difficult texts , 2016, CL4LC@COLING 2016.

[19]  G. Masters A rasch model for partial credit scoring , 1982 .

[20]  B. Underwood,et al.  A recognition test of vocabulary using signal-detection measures, and some correlates of word and nonword recognition. , 1977 .

[21]  Jorge Baptista,et al.  Automatic Text Difficulty Classifier - Assisting the Selection Of Adequate Reading Materials For European Portuguese Teaching , 2015, CSEDU.

[22]  B. Culligan,et al.  A comparison of three test formats to assess word difficulty , 2015 .

[23]  David Alfter,et al.  Classification of Swedish learner essays by CEFR levels , 2016 .

[24]  Lars Stenius Stæhr Vocabulary size and the skills of listening, reading and writing , 2008 .

[25]  Annette Capel,et al.  A1–B2 vocabulary: insights and issues arising from the English Profile Wordlists project , 2010 .

[26]  Howard Wainer,et al.  Computerized Adaptive Testing: A Primer , 2000 .

[27]  Ebrahim Khodadady Construct Validity of C-tests: A Factorial Approach , 2014 .

[28]  Judith A. Spray,et al.  The Relationship Between Item Exposure and Test Overlap in Computerized Adaptive Testing , 2003 .

[29]  David Engel,et al.  Educational Assessment Of Students , 2016 .

[30]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[31]  D. Eignor The standards for educational and psychological testing. , 2013 .

[32]  Ted Briscoe,et al.  Text Readability Assessment for Second Language Learners , 2016, BEA@NAACL-HLT.

[33]  Tessa Bent,et al.  Perceptual adaptation to non-native speech , 2008, Cognition.

[34]  Mark Dredze,et al.  Learning Simple Wikipedia: A Cogitation in Ascertaining Abecedarian Language , 2010, HLT-NAACL 2010.

[35]  Daniel R. Isbell Assessing C2 writing ability on the Certificate of English Language Proficiency: Rater and examinee age effects , 2017 .

[36]  Lyle F. Bachman,et al.  Language Assessment in Practice , 2010 .

[37]  A. V. Moere A psycholinguistic approach to oral language assessment , 2012 .

[38]  TOEFL iBT Linking TOEFL iBT ® Scores to IELTS ® Scores – , 2011 .

[39]  D. D. Bickerstaff,et al.  Computerized adaptive testing , 2015 .

[40]  D. Sculley,et al.  Combined regression and ranking , 2010, KDD.

[41]  Martha L. Stocking THREE PRACTICAL ISSUES FOR MODERN ADAPTIVE TESTING ITEM POOLS1 , 1994 .

[42]  James Milton,et al.  6. Aural word recognition and oral competence in english as a foreign language , 2010 .

[43]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[44]  Thora Vinther Elicited Imitation: A Brief Overview. , 2002 .

[45]  Anastassia Loukina,et al.  Textual complexity as a predictor of difficulty of listening items in language proficiency tests , 2016, COLING.

[46]  L. Blair A prediction. , 1995, Hospitals & health networks.

[47]  Beth Clark-Gareca,et al.  Language Assessment in Practice , 2010 .

[48]  Iryna Gurevych,et al.  Predicting the Difficulty of Language Proficiency Tests , 2014, TACL.

[49]  Jorge Baptista,et al.  Automatic Text Difficulty Classifier , 2015, CSEDU 2015.

[50]  F. Lord Applications of Item Response Theory To Practical Testing Problems , 1980 .

[51]  R. Freedle,et al.  Does the text matter in a multiple-choice test of comprehension? the case for the construct validity of TOEFL's minitalks , 1999 .

[52]  William H. DuBay Smart Language: Readers, Readability, and the Grading of Text , 2007 .

[53]  Takenobu Tokunaga,et al.  Item Difficulty Analysis of English Vocabulary Questions , 2016, CSEDU.

[54]  Claire Gardent,et al.  Generating Grammar Exercises , 2012, BEA@NAACL-HLT.

[55]  D. Andrich A rating formulation for ordered response categories , 1978 .

[56]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[57]  Jana Sarmanova,et al.  Computerized adaptive testing , 2014, 2014 IEEE 12th IEEE International Conference on Emerging eLearning Technologies and Applications (ICETA).

[58]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[59]  Christine Klein-Braley C-Tests in the context of reduced redundancy testing: an appraisal , 1997 .

[60]  Taraka Rama,et al.  Experiments with Universal CEFR Classification , 2018, BEA@NAACL-HLT.

[61]  Charles Elkan,et al.  Deriving TF-IDF as a Fisher Kernel , 2005, SPIRE.

[62]  Tessa Bent,et al.  The clear speech effect for non-native listeners. , 2002, The Journal of the Acoustical Society of America.

[63]  Michael Rube Redfield,et al.  Language Test Construction and Evaluation , 1997 .

[64]  Kuo-En Chang,et al.  Leveling L2 Texts Through Readability: Combining Multilevel Linguistic Features with the CEFR , 2015 .

[65]  吉島 茂,et al.  文化と言語の多様性の中のCommon European Framework of Reference for Languages: Learning, teaching, assessment (CEFR)--それは基準か? (第10回明海大学大学院応用言語学研究科セミナー 講演) , 2008 .

[66]  David J. Weiss,et al.  APPLICATION OF COMPUTERIZED ADAPTIVE TESTING TO EDUCATIONAL PROBLEMS , 1984 .

[67]  James Milton,et al.  The development of vocabulary breadth across the CEFR levels , 2010 .