Effectiveness of Item Response Theory (IRT) Proficiency Estimation Methods under Adaptive Multistage Testing. Research Report. ETS RR-15-11.

The purpose of this inquiry was to investigate the effectiveness of item response theory (IRT) proficiency estimators in terms of estimation bias and error under multistage testing (MST). We chose a 2-stage MST design in which 1 adaptation to the examinees' ability levels takes place. It includes 4 modules (1 at Stage 1, 3 at Stage 2) and 3 paths (low, middle, and high). When creating 2-stage MST panels (i.e., forms), we manipulated 2 assembly conditions in each module, such as difficulty level and module length, to see if any interaction existed between IRT estimation methods and MST panel designs. For each panel, we compared the accuracy of examinees' proficiency levels derived from 7 IRT proficiency estimators. We found that the choice of Bayesian (prior) and non-Bayesian (no prior) estimators was of more practical significance than the choice of number-correct versus item-pattern scoring. For the extreme proficiency levels, the decrease in standard error compensated for the increase in bias in the Bayesian estimates, resulting in smaller total error. Possible score changes caused by the use of different proficiency estimators would be nonnegligible, particularly for the extreme proficiency level examinees. The impact of misrouting at Stage 1 was minimal under the MST design used in this study.

[1]  T. A. Warm Weighted likelihood estimation of ability in item response theory , 1989 .

[2]  F. Lord Applications of Item Response Theory To Practical Testing Problems , 1980 .

[3]  R. D. Bock,et al.  Adaptive EAP Estimation of Ability in a Microcomputer Environment , 1982 .

[4]  Stephen G. Sireci,et al.  A Review of Models for Computer-Based Testing , 2012 .

[5]  R. Hambleton,et al.  Item Response Theory , 1984, The History of Educational Measurement.

[6]  F. Baker,et al.  Item response theory : parameter estimation techniques , 1993 .

[7]  Martha L. Stocking,et al.  Developing a Common Metric in Item Response Theory , 1982 .

[8]  Sooyeon Kim,et al.  An Investigation of the Impact of Misrouting under Two-Stage Multistage Testing: A Simulation Study. Research Report. ETS RR-14-01. , 2014 .

[9]  Michael J. Kolen,et al.  Comparisons of Methodologies and Results in Vertical Scaling for Educational Achievement Tests , 2007 .

[10]  R. Brennan,et al.  Item Response Theory Methods , 2014 .

[11]  Frank B. Baker,et al.  Item Response Theory : Parameter Estimation Techniques, Second Edition , 2004 .

[12]  Ronald K. Hambleton,et al.  Comparison of the Psychometric Properties of Several Computer-Based Test Designs for Credentialing Exams With Multiple Purposes , 2002 .

[13]  Sooyeon Kim,et al.  A Comparison of IRT Proficiency Estimation Methods Under Adaptive Multistage Testing , 2015 .

[14]  Frederic M. Lord,et al.  Comparison of IRT True-Score and Equipercentile Observed-Score "Equatings" , 1984 .

[15]  Richard M. Luecht,et al.  Some Practical Examples of Computer‐Adaptive Sequential Testing , 1998 .

[16]  Gilles Raîche,et al.  A Test-Length Correction to the Estimation of Extreme Proficiency Levels , 2011 .

[17]  Gregory L. Candell,et al.  Increasing Score Reliability with Item-Pattern Scoring: An Empirical Study in Five Score Metrics. , 1991 .

[18]  M. J. Kolen,et al.  Psychometric Properties of IRT Proficiency Estimates , 2010 .

[19]  Wendy M. Yen OBTAINING MAXIMUM LIKELIHOOD TRAIT ESTIMATES FROM NUMBER‐CORRECT SCORES FOR THE THREE‐PARAMETER LOGISTIC MODEL , 1984 .

[20]  Barbara S. Plake,et al.  Monte Carlo Simulation Comparison of Two-Stage Testing and Computerized Adaptive Testing. , 1993 .

[21]  Richard M. Luecht,et al.  A Testlet Assembly Design for Adaptive Multistage Tests , 2006 .