The Prevalence and Implications of Slipping on Low-Stakes, Large-Scale Assessments

In the absence of clear incentives, achievement tests may be subject to the effect of slipping where item response functions have upper asymptotes below one. Slipping reduces score precision for higher latent scores and distorts test developers’ understandings of item and test information. A multidimensional four-parameter normal ogive model was developed for large-scale assessments and applied to dichotomous items of the 2011 National Assessment of Educational Progress eighth-grade mathematics and reading tests. The results suggest that the probability of slipping exceeded 5% for 47.2% and 51.1% of the dichotomous mathematics and reading items, respectively. Furthermore, allowing for slipping resulted in larger item discrimination parameters, increased information in the lower-to-middle range of the latent trait, and decreased precision for scores one standard deviation above the mean. The results provide evidence that slipping is a factor that should be considered during test development and construction to ensure adequate measurement across the latent continuum.

[1]  Henry Braun,et al.  An Enhanced Method for Mapping State Standards onto the NAEP Scale , 2007 .

[2]  Harold F. O'Neil,et al.  Effects of Motivational Interventions on the National Assessment of Educational Progress Mathematics Performance , 1995 .

[3]  Bruce D. Baker,et al.  The Legal Consequences of Mandating High Stakes Decisions Based on Low Quality Information: Teacher Evaluation in the Race-to-the-Top Era , 2013 .

[4]  Mary Kathryn Cowles,et al.  Accelerating Monte Carlo Markov chain convergence for cumulative-link generalized linear models , 1996, Stat. Comput..

[5]  J. Albert Bayesian Estimation of Normal Ogive Item Response Curves Using Gibbs Sampling , 1992 .

[6]  Peter D. Hoff,et al.  A First Course in Bayesian Statistical Methods , 2009 .

[7]  Hua-Hua Chang,et al.  To Weight or Not to Weight? Balancing Influence of Initial Items in Adaptive Testing , 2007 .

[8]  Harold C. Doran Methods for Incorporating Measurement Error in Value-Added Models and Teacher Classifications , 2014 .

[9]  E. Muraki,et al.  Chapter 3: Scaling Procedures in NAEP , 1992 .

[10]  D. Koretz Adapting Educational Measurement to the Demands of Test-Based Accountability , 2015 .

[11]  S. Reise,et al.  How many IRT parameters does it take to model psychopathology items? , 2003, Psychological methods.

[12]  Johannes Hartig,et al.  Student, School, and Country Differences in Sustained Test-Taking Effort in the 2009 PISA Reading Assessment , 2014 .

[13]  Matthew S. Johnson,et al.  A BAYESIAN HIERARCHICAL MODEL FOR LARGE-SCALE EDUCATIONAL SURVEYS: AN APPLICATION TO THE NATIONAL ASSESSMENT OF EDUCATIONAL PROGRESS , 2004 .

[14]  Identifiers California,et al.  Annual Meeting of the National Council on Measurement in Education , 1998 .

[15]  Frederic M. Lord,et al.  An Upper Asymptote for the Three-Parameter Logistic Item-Response Model. , 1981 .

[16]  Andrew Gelman,et al.  General methods for monitoring convergence of iterative simulations , 1998 .

[17]  Eugene G. Johnson,et al.  Scaling Procedures in NAEP , 1992 .

[18]  Brian W. Junker,et al.  Applications and Extensions of MCMC in IRT: Multiple Item Types, Missing Data, and Rated Responses , 1999 .

[19]  Christine E. DeMars,et al.  Low Examinee Effort in Low-Stakes Assessment: Problems and Potential Solutions , 2005 .

[20]  Haruhiko Ogasawara,et al.  Asymptotic expansions for the ability estimator in item response theory , 2012, Comput. Stat..

[21]  D. Magis A Note on the Item Information Function of the Four-Parameter Logistic Model , 2013 .

[22]  S. Culpepper Revisiting the 4-Parameter Item Response Model: Bayesian Estimation and Application , 2016, Psychometrika.

[23]  S. Reise,et al.  Measuring Psychopathology With Nonstandard Item Response Theory Models: Fitting the Four-Parameter Model to the Minnesota Multiphasic Personality Inventory , 2010 .

[24]  Rebecca Zwick,et al.  Overview of the National Assessment of Educational Progress , 1992 .

[25]  Daniel F. McCaffrey,et al.  The Impact of Measurement Error on the Accuracy of Individual and Aggregate SGP , 2015 .

[26]  Bradley P. Carlin,et al.  Bayesian measures of model complexity and fit , 2002 .

[27]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[28]  Daniel F. McCaffrey,et al.  Correcting for Test Score Measurement Error in ANCOVA Models for Estimating Treatment Effects , 2014 .

[29]  André A. Rupp,et al.  An NCME Instructional Module on Booklet Designs in Large‐Scale Assessments of Student Achievement: Theory and Practice , 2009 .

[30]  A. Béguin,et al.  MCMC estimation and some model-fit analysis of multidimensional IRT models , 2001 .

[31]  Kelly L. Rulison,et al.  I've Fallen and I Can't Get Up: Can High-Ability Students Recover From Early Mistakes in CAT? , 2009, Applied psychological measurement.

[32]  Jamal Abedi,et al.  Monetary Incentives for Low-Stakes Tests , 2005 .

[33]  W. A. Nicewander,et al.  Reliability and Information Functions for Percentile Ranks , 1994 .

[34]  Eric Loken,et al.  Estimation of a four-parameter item response theory model. , 2010, The British journal of mathematical and statistical psychology.

[35]  J. Brophy,et al.  NAEP Testing for Twelfth Graders: Motivational Issues. , 2005 .

[36]  N. Waller,et al.  Abstract: Estimation of the 4-Parameter Model with Marginal Maximum Likelihood , 2014, Multivariate behavioral research.

[37]  Henry Braun,et al.  An Experimental Study of the Effects of Monetary Incentives on Performance on the 12th-Grade NAEP Reading Assessment , 2011, Teachers College Record: The Voice of Scholarship in Education.

[38]  Lindsay Fox,et al.  Is a Good Teacher a Good Teacher for All? Comparing Value-Added of Teachers With Their English Learners and Non-English Learners , 2014 .

[39]  M. Davier,et al.  EXTENSION OF THE NAEP BGROUP PROGRAM TO HIGHER DIMENSIONS , 2005 .

[40]  A. Beaton,et al.  Chapter 1: Overview of the National Assessment of Educational Progress , 1992 .