Jump-Starting Item Parameters for Adaptive Language Tests

A challenge in designing high-stakes language assessments is calibrating the test item difficulties, either a priori or from limited pilot test data. While prior work has addressed ‘cold start’ estimation of item difficulties without piloting, we devise a multi-task generalized linear model with BERT features to jump-start these estimates, rapidly improving their quality with as few as 500 test-takers and a small sample of item exposures (≈6 each) from a large item bank (≈4,000 items). Our joint model provides a principled way to compare test-taker proficiency, item difficulty, and language proficiency frameworks like the Common European Framework of Reference (CEFR). This also enables new item difficulty estimates without piloting them first, which in turn limits item exposure and thus enhances test security. Finally, using operational data from the Duolingo English Test, a high-stakes English proficiency test, we find that difficulty estimates derived using this method correlate strongly with lexico-grammatical features that correlate with reading complexity.

[1]  Alex Wang,et al.  Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling , 2018, ACL.

[2]  Geoffrey T. LaFlair,et al.  Machine Learning–Driven Language Assessment , 2020, Transactions of the Association for Computational Linguistics.

[3]  D. Eignor The standards for educational and psychological testing. , 2013 .

[4]  Monique Reichert,et al.  The C-test, the TCF and the CEFR: a validation study. , 2010 .

[5]  V. Bacharach,et al.  Psychometrics : An Introduction , 2007 .

[6]  P. McCullagh Regression Models for Ordinal Data , 1980 .

[7]  Douglas Biber,et al.  Should we use characteristics of conversation to measure grammatical complexity in L2 writing development , 2011 .

[8]  D. Andrich A rating formulation for ordered response categories , 1978 .

[9]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[10]  J. Chall,et al.  Readability revisited : the new Dale-Chall readability formula , 1995 .

[11]  T. Bechger,et al.  A Rasch Model and Rating System for Continuous Responses Collected in Large-Scale Learning Systems , 2020, Frontiers in Psychology.

[12]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[13]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[14]  Rajat Raina,et al.  Constructing informative priors using transfer learning , 2006, ICML.

[15]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[16]  R. Cattell The Scree Test For The Number Of Factors. , 1966, Multivariate behavioral research.

[17]  John P. Cunningham,et al.  The continuous Bernoulli: fixing a pervasive error in variational autoencoders , 2019, NeurIPS.

[18]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[19]  G. H. Fischer,et al.  The linear logistic test model as an instrument in educational research , 1973 .

[20]  H. Müller,et al.  A rasch model for continuous ratings , 1987 .

[21]  Thomas Wolf,et al.  Transfer Learning in Natural Language Processing , 2019, NAACL.

[22]  G. Masters A rasch model for partial credit scoring , 1982 .

[23]  Iryna Gurevych,et al.  Predicting the Difficulty of Language Proficiency Tests , 2014, TACL.

[24]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[25]  Victoria Yaneva,et al.  Predicting the Difficulty and Response Time of Multiple Choice Questions Using Transfer Learning , 2020, BEA.

[26]  Wilson L. Taylor,et al.  “Cloze Procedure”: A New Tool for Measuring Readability , 1953 .

[27]  Ebrahim Khodadady Construct Validity of C-tests: A Factorial Approach , 2014 .

[28]  Iryna Gurevych,et al.  Candidate evaluation strategies for improved difficulty prediction of language tests , 2015, BEA@NAACL-HLT.

[29]  Christine E. DeMars,et al.  Item Response Theory , 2010, Assessing Measurement Invariance for Applied Research.

[30]  Victoria Yaneva,et al.  Automated Prediction of Examinee Proficiency from Short-Answer Questions , 2020, COLING.

[31]  Arya D. McCarthy,et al.  Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade , 2019, IWSLT.

[32]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[33]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[34]  Note on the squared multiple correlation as a lower bound to communality , 1978 .

[35]  T. Eckes Item banking for C-tests: A polytomous Rasch modeling approach , 2011 .

[36]  F. Lord Applications of Item Response Theory To Practical Testing Problems , 1980 .

[37]  Anastassia Loukina,et al.  Textual complexity as a predictor of difficulty of listening items in language proficiency tests , 2016, COLING.

[38]  Georg Rasch,et al.  Probabilistic Models for Some Intelligence and Attainment Tests , 1981, The SAGE Encyclopedia of Research Design.

[39]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[40]  Allyson Ettinger,et al.  What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models , 2019, TACL.

[41]  Francis R. Bach,et al.  On the Consistency of Ordinal Regression Methods , 2014, J. Mach. Learn. Res..

[42]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[43]  N. Verhelst Exponential Family Models for Continuous Responses , 2019, Theoretical and Practical Advances in Computer-based Educational Measurement.

[44]  H. Kaiser An index of factorial simplicity , 1974 .

[45]  P. McCullagh Analysis of Ordinal Categorical Data , 1985 .

[46]  J. Mckillip,et al.  Fundamentals of item response theory , 1993 .

[47]  Peter Brusilovsky,et al.  Integrating Knowledge Tracing and Item Response Theory: A Tale of Two Frameworks , 2014, UMAP Workshops.

[48]  Nederlandse Taalunie,et al.  Common European Framework of Reference for Languages: Learning, Teaching, Assessment , 2007 .

[49]  C. O. Houle : Marks of Readable Style: A Study in Adult Education , 1945 .

[50]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.