Developments in Psychometric Population Models for Technology-Based Large-Scale Assessments: An Overview of Challenges and Opportunities

International large-scale assessments (ILSAs) transitioned from paper-based assessments to computer-based assessments (CBAs) facilitating the use of new item types and more effective data collection tools. This allows implementation of more complex test designs and to collect process and response time (RT) data. These new data types can be used to improve data quality and the accuracy of test scores obtained through latent regression (population) models. However, the move to a CBA also poses challenges for comparability and trend measurement, one of the major goals in ISLAs. We provide an overview of current methods used in ILSAs to examine and assure the comparability of data across different assessment modes and methods that improve the accuracy of test scores by making use of new data types provided by a CBA.

[1]  Matthias von Davier,et al.  A general diagnostic model applied to language testing data. , 2008, The British journal of mathematical and statistical psychology.

[2]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[3]  Wim J. van der Linden,et al.  IRT Parameter Estimation With Response Times as Collateral Information , 2010 .

[4]  G. Ohlin The Organization for Economic Cooperation and Development , 1968, International Organization.

[5]  Helene Fowkes,et al.  A method based on the chi-square test for document classification , 2001, SIGIR '01.

[6]  Steven L. Wise,et al.  An Application of Item Response Time: The Effort‐Moderated IRT Model , 2006 .

[7]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[8]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[9]  Steven L. Wise,et al.  Response Time Effort: A New Measure of Examinee Motivation in Computer-Based Tests , 2005 .

[10]  Matthias von Davier,et al.  Measuring Growth in a Longitudinal Large-Scale Assessment with a General Latent Variable Model , 2011 .

[11]  Heiko Rölke,et al.  The time on task effect in reading and problem solving is moderated by task difficulty and skill: Insights from a computer-based large-scale assessment. , 2014 .

[12]  Leslie Rutkowski,et al.  Handbook of International Large-Scale Assessment : Background, Technical Issues, and Methods of Data Analysis , 2013 .

[13]  Jian Pei,et al.  A brief survey on sequence classification , 2010, SKDD.

[14]  Wim J. van der Linden,et al.  Bayesian Procedures for Identifying Aberrant Response-Time Patterns in Adaptive Testing , 2008 .

[15]  Eric Maris,et al.  Additive and multiplicative models for gamma distributed random variables, and their application as psychometric models for response times , 1993 .

[16]  Francis Tuerlinckx,et al.  A Bivariate Generalized Linear Item Response Theory Modeling Framework to the Analysis of Responses and Response Times , 2015, Multivariate behavioral research.

[17]  Matthias von Davier,et al.  Analyzing Process Data from Problem-Solving Items with N-Grams: Insights from a Computer-Based Large-Scale Assessment , 2016 .

[18]  Yi-Hsuan Lee,et al.  A review of recent response-time analyses in educational testing , 2011 .

[19]  J. P. Meyer,et al.  A Mixture Rasch Model With Item Response Time Components , 2010 .

[20]  J WIM,et al.  A HIERARCHICAL FRAMEWORK FOR MODELING SPEED AND ACCURACY ON TEST ITEMS , 2007 .

[21]  Matthias von Davier,et al.  Identifying Feature Sequences from Process Data in Problem-Solving Items with N -Grams , 2015 .

[22]  J. Fox,et al.  Joint Modeling of Ability and Differential Speed Using Responses and Response Times , 2016, Multivariate behavioral research.

[23]  Alan Agresti,et al.  Categorical Data Analysis , 2003 .

[24]  Matthias von Davier,et al.  A UNIFIED APPROACH TO IRT SCALE LINKING AND SCALE TRANSFORMATIONS , 2004 .

[25]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[26]  Paul De Boeck,et al.  Can fast and slow intelligence be differentiated , 2012 .

[27]  Norman Rose,et al.  Modeling Omitted and Not-Reached Items in IRT Models , 2017, Psychometrika.

[28]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[29]  Shelby J. Haberman,et al.  A New Procedure for Detection of Students’ Rapid Guessing Responses Using Response Time , 2016 .

[30]  C. Glas,et al.  Nonignorable data in IRT models: Polytomous responses and response propensity models with covariates , 2015 .

[31]  Daniel L. Oberski,et al.  Markov Response Models 1 RUNNING HEAD : Markov Response Models Hidden Markov IRT Models for Responses and Response Times , 2016 .

[32]  R. Tibshirani,et al.  Forward stagewise regression and the monotone lasso , 2007, 0705.0269.

[33]  Johannes Naumann,et al.  More is not Always Better: The Relation between Item Response and Item Response Time in Raven's Matrices , 2015 .

[34]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[35]  Sunita Sarawagi,et al.  Sequence Data Mining , 2005 .

[36]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[37]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[38]  W. Meredith Measurement invariance, factor analysis and factorial invariance , 1993 .

[39]  Malik Beshir Malik,et al.  Applied Linear Regression , 2005, Technometrics.

[40]  H. Bozdogan Model selection and Akaike's Information Criterion (AIC): The general theory and its analytical extensions , 1987 .

[41]  Matthias von Davier,et al.  Analytics in International Large-Scale Assessments: Item Response Theory and Population Models , 2013 .

[42]  Yulia Dodonova,et al.  Faster on easy items, more accurate on difficult ones: Cognitive ability and performance on a task of varying difficulty , 2013 .

[43]  Stephen G. Sireci,et al.  ON THE RELIABILITY OF TESTLET‐BASED TESTS , 1991 .

[44]  N. Thomas,et al.  The role of secondary covariates when estimating latent trait population distributions , 2002 .

[45]  Samuel Greiff,et al.  Computer-generated log-file analyses as a window into students' minds? A showcase study based on the PISA 2012 assessment of problem solving , 2015, Comput. Educ..

[46]  Eunike Wetzel,et al.  An Alternative Way to Model Population Ability Distributions in Large-Scale Educational Surveys , 2015, Educational and psychological measurement.

[47]  Jonathan P. Weeks,et al.  Using Response Time Data to Inform the Coding of Omitted Responses , 2016 .

[48]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[49]  Eric T. Bradlow,et al.  Testlet Response Theory and Its Applications , 2007 .

[50]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[51]  Jeffrey N. Rouder,et al.  A hierarchical bayesian statistical framework for response time distributions , 2003 .

[52]  Daniel Kudenko,et al.  Feature Generation for Sequence Categorization , 1998, AAAI/IAAI.

[53]  Georg Rasch,et al.  Probabilistic Models for Some Intelligence and Attainment Tests , 1981, The SAGE Encyclopedia of Research Design.

[54]  Robert J. Mislevy,et al.  Randomization-based inference about latent variables from complex samples , 1991 .

[55]  H. Akaike A new look at the statistical model identification , 1974 .

[56]  Ana I. González Acuña An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, Boosting, and Randomization , 2012 .

[57]  Bernard P. Veldkamp,et al.  Predicting self-monitoring skills using textual posts on Facebook , 2014, Comput. Hum. Behav..

[58]  Qiwei He,et al.  Screening for posttraumatic stress disorder using verbal features in self narratives: A text mining approach , 2012, Psychiatry Research.

[59]  van der Linden,et al.  A hierarchical framework for modeling speed and accuracy on test items , 2007 .

[60]  Krista Breithaupt,et al.  Detecting Differential Speededness in Multistage Testing , 2007 .

[61]  Matthias von Davier,et al.  Imputing Proficiency Data under Planned Missingness in Population Models , 2013 .

[62]  J. Fox,et al.  Bayesian tests of measurement invariance. , 2012, The British journal of mathematical and statistical psychology.

[63]  Yi-Hsuan Lee,et al.  Using response time to investigate students' test-taking behaviors in a NAEP computer-based study , 2014, Large-scale Assessments in Education.

[64]  Qiwei He,et al.  Automated Assessment of Patients’ Self-Narratives for Posttraumatic Stress Disorder Screening Using Natural Language Processing and Text Mining , 2017, Assessment.

[65]  Anja S. Göritz,et al.  Sometimes More Is Better, and Sometimes Less Is Better: Task Complexity Moderates the Response Time Accuracy Correlation , 2016 .

[66]  Matthias von Davier,et al.  Investigation of model fit and score scale comparability in international assessments , 2011 .

[67]  John Mazzeo The Use of Collateral Information in Proficiency Estimation for the Trial State Assessment. , 1992 .

[68]  R. Millsap Testing Measurement Invariance Using Item Response Theory in Longitudinal Data: An Introduction , 2010 .