Evaluating Comparability in Computerized Adaptive Testing: Issues, Criteria and an Example

When a computerized adaptive testing (CAT) version of a test co-exists with its paper-and-pencil (P&P) version, it is important for scores from the CAT version to be comparable to scores from its P&P version. The CAT version may require multiple item pools for test security reasons, and CAT scores based on alternate pools also need to be comparable to each other. In this paper, we review research literature on CAT comparability issues and synthesize issues specific to these two settings. A framework of criteria for evaluating comparability was developed that contains the following three categories of criteria: validity criterion, psychometric property/reliability criterion, and statistical assumption/test administration condition criterion. Methods for evaluating comparability under these criteria as well as various algorithms for improving comparability are described and discussed. Focusing on the psychometric property/reliability criterion, an example using an item pool of ACT Assessment Mathematics items is provided to demonstrate a process for developing comparable CAT versions and for evaluating comparability. This example illustrates how simulations can be used to improve comparability at the early stages of the development of a CAT. The effects of different specifications of practical constraints, such as content balancing and item exposure rate control, and the effects of using alternate item pools are examined. One interesting finding from this study is that a large part of incomparability may be due to the change from number-correct score-based scoring to IRT ability estimation-based scoring. In addition, changes in components of a CAT, such as exposure rate control, content balancing, test length, and item pool size were found to result in different levels of comparability in test scores.

[1]  Walter P. Vispoel,et al.  Individual Differences and Test Administration Procedures: A Comparison of Fixed-Item, Computerized-Adaptive, and Self-Adapted Testing. , 1994 .

[2]  M. J. Kolen Threats to Score Comparability with Applications to Performance Assessments and Computerized Adaptive Tests , 1999 .

[3]  Daniel R. Eignor,et al.  DERIVING COMPARABLE SCORES FOR COMPUTER ADAPTIVE AND CONVENTIONAL TESTS: AN EXAMPLE USING THE SAT1,2 , 1993 .

[4]  Martha L. Stocking,et al.  A New Method of Controlling Item Exposure in Computerized Adaptive Testing. , 1995 .

[5]  Deborah J. Harris,et al.  Psychometric Properties of Scale Scores and Performance Levels for Performance Assessments Using Polytomous IRT , 2000 .

[6]  Martha L. Stocking,et al.  Practical Issues in Large-Scale Computerized Adaptive Testing , 1996 .

[7]  Mark D. Reckase,et al.  Effect of the Medium of Item Presentation on Examinee Performance and Item Characteristics , 1989 .

[8]  Martha L. Stocking Some Considerations in Maintaining Adaptive Test Item Pools. , 1988 .

[9]  F. Vijver,et al.  The incomplete equivalence of the paper-and-pencil and computerized versions of the General Aptitude Test Battery , 1994 .

[10]  Bert F. Green,et al.  Adaptive Estimation When the Unidimensionality Assumption of IRT is Violated , 1989 .

[11]  Cynthia G. Parshall,et al.  New Algorithms for Item Selection and Exposure Control with Computerized Adaptive Testing. , 1995 .

[12]  Steven L. Wise,et al.  The Relationship between Examinee Anxiety and Preference for Self-Adapted Testing. , 1994 .

[13]  Wim J. van der Linden,et al.  A Model for Optimal Constrained Adaptive Testing , 1998 .

[14]  Anne L. Harvey,et al.  The Equivalence of Scores from Automated and Conventional Educational and Psychological Tests: A Review of the Literature. College Board Report No. 88-8. , 1988 .

[15]  Martha L. Stocking,et al.  A Method for Severely Constrained Item Selection in Adaptive Testing , 1992 .

[16]  Rebecca D. Hetter,et al.  A Comparison of Item Calibration Media in Computerized Adaptive Testing , 1994 .

[17]  Neil J. Dorans,et al.  Item Response Theory, Item Calibration, and Proficiency Estimation , 2000 .

[18]  Robert Cudeck,et al.  Adaptive and Conventional Versions of the DAT: The First Complete Test Battery Comparison , 1989 .

[19]  R. Darrell Bock,et al.  Item Pool Maintenance in the Presence of Item Parameter Drift. , 1988 .

[20]  M. Reckase Unifactor Latent Trait Models Applied to Multifactor Tests: Results and Implications , 1979 .

[21]  Gary A. Schaeffer The Introduction and Comparability of the Computer Adaptive GRE General Test. GRE Board Professional Report No. 88-08aP. , 1995 .

[22]  Martha L. Stocking Scale Drift in On-Line Calibration. , 1988 .

[23]  M. J. Kolen,et al.  Conditional Standard Errors of Measurement for Scale Scores Using IRT , 1996 .

[24]  M. Lunz,et al.  Equating Computerized Adaptive Certification Examinations: The Board of Registry Series of Studies. , 1995 .

[25]  Craig N. Mills,et al.  FIELD TEST OF A COMPUTER-BASED GRE GENERAL TEST , 1993 .

[26]  F. Lord Applications of Item Response Theory To Practical Testing Problems , 1980 .

[27]  R. Hambleton,et al.  Item Response Theory , 1984, The History of Educational Measurement.

[28]  Richard M. Luecht,et al.  Multidimensional Computerized Adaptive Testing in a Certification or Licensure Context , 1996 .

[29]  Daniel O. Segall,et al.  Multidimensional adaptive testing , 1996 .

[30]  R. Cudeck,et al.  A Structural Comparison of Conventional and Adaptive Versions of the ASVAB. , 1985, Multivariate behavioral research.

[31]  R. Linn Educational measurement, 3rd ed. , 1989 .

[32]  M. Stocking,et al.  A Model and Heuristic For Solving Very Large Item Selection Problems , 1993 .

[33]  F. Drasgow,et al.  Equivalence of computerized and paper-and-pencil cognitive ability tests: A meta-analysis. , 1993 .

[34]  Robert L. Brennan,et al.  Conditional standard errors of measurement for scale scores using binomial and compund binomial assu , 1992 .

[35]  R. Hambleton,et al.  Item Response Theory: Principles and Applications , 1984 .

[36]  Neal M. Kingston,et al.  Item Location Effects and Their Implications for IRT Equating and Adaptive Testing , 1984 .

[37]  Martha L. Stocking,et al.  An Alternative Method for Scoring Adaptive Tests , 1996 .

[38]  Cynthia G. Parshall,et al.  Computer Testing versus Paper-and-Pencil Testing: An Analysis of Examinee Characteristics Associated with Mode Effect. , 1993 .

[39]  R. Brennan,et al.  Test equating : methods and practices , 1995 .

[40]  Willem J. van der Linden,et al.  Optimal Assembly of Psychological and Educational Tests , 1998 .

[41]  Brian Habing,et al.  Conditional Covariance-Based Nonparametric Multidimensionality Assessment , 1996 .

[42]  Walter D. Way Protecting the Integrity of Computerized Testing Item Pools , 1998 .

[43]  Reducing Bias in CAT Trait Estimation: A Comparison of Approaches , 1999 .

[44]  Shu-Ying Chen,et al.  Exploring the Relationship between Item Exposure Rate and Test Overlap Rate in Computerized Adaptive Testing. , 1999 .

[45]  Anthony R. Zara,et al.  A Comparison of Procedures for Content-Sensitive Item Selection in Computerized Adaptive Tests. , 1991 .