Methods for Evaluating Composite Reliability, Classification Consistency, and Classification Accuracy for Mixed-Format Licensure Tests

The purpose of this study was to propose extensions of reliability estimation methods that could be used to determine the conditions under which single scoring for constructed-response (CR) items is as effective as double scoring in mixed-format licensure tests. Multivariate generalizability theory methods traditionally used to estimate overall composite score reliability were extended with simulations so that classification consistency and classification accuracy estimates could also be obtained. Composite score reliabilities, classification consistencies, and accuracies were estimated based on the double and single scoring of the CR items of three licensure tests. Composite score reliabilities, classification consistencies, and accuracies were also estimated in decision studies considering varied testing situations such as different numbers of CR items and different CR section weights.

[1]  G. Bolton Reliability , 2003, Medical Humanities.

[2]  G. Joe,et al.  Some developments in multivariate generalizability , 1976 .

[3]  Charles Lewis,et al.  ESTIMATING THE CONSISTENCY AND ACCURACY OF CLASSIFICATIONS BASED ON TEST SCORES , 1993 .

[4]  W. Mollenkopf Variation of the standard error of measurement , 1949, Psychometrika.

[5]  Frederic M. Lord An empirical study of the normality and independence of errors of measurement in test scores , 1960 .

[6]  B. Clauser,et al.  A Multivariate Generalizability Analysis of History-Taking and Physical Examination Scores From the USMLE Step 2 Clinical Skills Examination , 2009, Academic medicine : journal of the Association of American Medical Colleges.

[7]  Allen I. Fleishman A method for simulating non-normal distributions , 1978 .

[8]  K. Jöreskog A general approach to confirmatory maximum likelihood factor analysis , 1969 .

[9]  Nigel O'Brian,et al.  Generalizability Theory I , 2003 .

[10]  Xiaohong Gao,et al.  Variability of Estimated Variance Components and Related Statistics in a Performance Assessment , 2001 .

[11]  Sooyeon Kim,et al.  Comparisons among Designs for Equating Mixed‐Format Tests in Large‐Scale Assessments , 2010 .

[12]  D. Jarjoura,et al.  A Multivariate Generalizability Model for Clinical Skills Assessments , 2004 .

[13]  Sooyeon Kim,et al.  Determining When Single Scoring for Constructed-Response Items Is as Effective as Double Scoring in Mixed-Format Licensure Tests , 2013 .