Reporting Valid and Reliable Overall Scores and Domain Scores

In educational assessment, overall scores obtained by simply averaging a number of domain scores are sometimes reported. However, simply averaging the domain scores ignores the fact that different domains have different score points, that scores from those domains are related, and that at different score points the relationship between overall score and domain score may be different. To report reliable and valid overall scores and domain scores, I investigated the performance of four methods using both real and simulation data: (a) the unidimensional IRT model; (b) the higher-order IRT model, which simultaneously estimates the overall ability and domain abilities; (c) the multidimensional IRT (MIRT) model, which estimates domain abilities and uses the maximum information method to obtain the overall ability; and (d) the bifactor general model. My findings suggest that the MIRT model not only provides reliable domain scores, but also produces reliable overall scores. The overall score from the MIRT maximum information method has the smallest standard error of measurement. In addition, unlike the other models, there is no linear relationship assumed between overall score and domain scores. Recommendations for sizes of correlations between domains and the number of items needed for reporting purposes are provided.

[1]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[2]  M. Reckase Multidimensional Item Response Theory , 2009 .

[3]  M. Reckase The Past and Future of Multidimensional Item Response Theory , 1997 .

[4]  Lihua Yao,et al.  Multidimensional Linking for Tests with Mixed Item Types. , 2009 .

[5]  James E. Carlson,et al.  Full-Information Factor Analysis for Polytomous Item Responses , 1995 .

[6]  Richard J. Patz,et al.  A Straightforward Approach to Markov Chain Monte Carlo Methods for Item Response Models , 1999 .

[7]  Increasing the Precision of Subscale Scores by Using Out-of-Scale Information , 2004 .

[8]  Howard Wainer,et al.  Augmented Scores-"Borrowing Strength" to Compute Scores Based on Small Numbers ofltems , 2001 .

[9]  Terry A. Ackerman,et al.  Concurrent Adaptive Measurement of Multiple Abilities. , 1991 .

[10]  Jimmy de la Torre,et al.  Parameter Estimation With Small Sample Size A Higher-Order IRT Model Approach , 2010 .

[12]  Colin Fraser,et al.  NOHARM: Least Squares Item Factor Analysis. , 1988, Multivariate behavioral research.

[13]  Wen-Chung Wang,et al.  Improving measurement precision of test batteries using multidimensional item response models. , 2004, Psychological methods.

[14]  Christopher K. Wikle,et al.  Bayesian Multidimensional IRT Models With a Hierarchical Structure , 2008 .

[15]  Robert J. Mislevy Exploiting auxiliary information about examinees in the estimation of item parameters , 1986 .

[16]  Lihua Yao,et al.  A Multidimensional Item Response Modeling Approach for Improving Subscale Proficiency Estimation and Classification , 2007 .

[17]  A. Béguin,et al.  MCMC estimation and some model-fit analysis of multidimensional IRT models , 2001 .

[18]  Robert J. Mislevy,et al.  The role of collateral information about examinees in item parameter estimation , 1989 .

[19]  Wendy M. Yen A Bayesian/IRT Index of Objective Performance 1 , 1987 .

[20]  Daniel O. Segall,et al.  Multidimensional adaptive testing , 1996 .

[21]  Mark D. Reckase,et al.  The Difficulty of Test Items That Measure More Than One Ability , 1985 .

[22]  Mark D. Reckase,et al.  The Discriminating Power of Items That Measure More Than One Dimension , 1991 .

[23]  Lihua Yao,et al.  A Multidimensional Partial Credit Model With Associated Item and Test Statistics: An Application to Mixed-Format Tests , 2006 .

[24]  Brian W. Junker,et al.  Applications and Extensions of MCMC in IRT: Multiple Item Types, Missing Data, and Rated Responses , 1999 .