The NAEP EDM Competition: Theory-Driven Psychometrics and Machine Learning for Predictions Based on Log Data

The 2nd Annual WPI-UMASS-UPENN EDM Data Mining Challenge required contestants to predict efficient testtaking based on log data. In this paper, we describe our theory-driven and psychometric modeling approach. For feature engineering, we employed the Log-Normal Response Time Model for estimating latent person speed, and the Generalized Partial Credit Model for estimating latent person ability. Additionally, we adopted an n-gram feature approach for event sequences. For training a multi-label classifier, we distinguished inefficient test takers who were going too fast and those who were going too slow, instead of using the provided binary target label. Our best-performing ensemble classifier comprised three sets of low-dimensional classifiers, dominated by test-taker speed. While our classifier reached moderate performance, relative to competition leaderboard, our approach makes two important contributions. First, we show how explainable classifiers could provide meaningful predictions if results can be contextualized to test administrators who wish to intervene or take action. Second, our re-engineering of test scores enabled us to incorporate person ability into the estimation. However, ability was hardly predictive of efficient behavior, leading to the conclusion that the target label’s validity needs to be questioned. The paper concludes with tools that are helpful for substantively meaningful log data mining.

[1]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[2]  D. Eignor The standards for educational and psychological testing. , 2013 .

[3]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[4]  S. Wise An Information-Based Approach to Identifying Rapid-Guessing Thresholds , 2019, Applied Measurement in Education.

[5]  Bernd Bischl,et al.  mlr: Machine Learning in R , 2016, J. Mach. Learn. Res..

[6]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[7]  Steven L. Wise,et al.  Response Time Effort: A New Measure of Examinee Motivation in Computer-Based Tests , 2005 .

[8]  Deborah L. Schnipke,et al.  Modeling Item Response Times With a Two-State Mixture Model: A New Method of Measuring , 1997 .

[9]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[10]  Willem J. van der Linden,et al.  A lognormal model for response times on test items , 2006 .

[11]  T. A. Warm Weighted likelihood estimation of ability in item response theory , 1989 .

[12]  Fabian Zehner,et al.  What to Make Of and How to Interpret Process Data , 2017 .

[13]  van der Linden,et al.  A hierarchical framework for modeling speed and accuracy on test items , 2007 .

[14]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[15]  Steven L. Wise,et al.  Rapid-Guessing Behavior: Its Identification, Interpretation, and Implications , 2017 .

[16]  E. Muraki A GENERALIZED PARTIAL CREDIT MODEL: APPLICATION OF AN EM ALGORITHM , 1992 .

[17]  Frank Goldhammer,et al.  How to conceptualize, represent, and analyze log data from technology-based assessments? A generic framework and an application to questionnaire items , 2018, Behaviormetrika.

[18]  Matthias von Davier,et al.  Analyzing Process Data from Problem-Solving Items with N-Grams: Insights from a Computer-Based Large-Scale Assessment , 2016 .

[19]  Hongyun Liu,et al.  Modeling Test-Taking Non-effort in MIRT Models , 2019, Front. Psychol..

[20]  J WIM,et al.  A HIERARCHICAL FRAMEWORK FOR MODELING SPEED AND ACCURACY ON TEST ITEMS , 2007 .

[21]  Georg Rasch,et al.  Probabilistic Models for Some Intelligence and Attainment Tests , 1981, The SAGE Encyclopedia of Research Design.