A Systematic Exploration of Judge Scoring Designs and Judge Analysis Methods in Performance Assessment

Quality control of judges’ rating behaviors is directly related to the validity of examinees’ scores in a performance assessment. The purpose of the present study was twofold: (a) to compare two different estimation techniques under Rasch and nonlinear mixed modeling perspectives; and (b) to compare judge severity estimates under two different scoring designs. SAS NLMIXED and FACETS software packages were used to evaluate the accuracy of the two estimation techniques. The judge scoring design of a live English proficiency test was one of the designs under investigation in this study. Results indicated that the two analytical methods performed comparably in estimating the true values of judge severity. On the other hand, the spiral design of the two judge scoring strategies performed with an acceptable degree of accuracy whereas the true values of the model effects including judge severity were substantially compromised in the nested design. The present study illustrated an example of effective ways to strategize a judge scoring design and to estimate the true values of judge severity in performance testing.

[1]  Constructing an Item Bank Using Partial Credit Scoring. , 1984 .

[2]  Matthew S. Johnson,et al.  A Hierarchical Rater Model for Constructed Responses, with a Signal Detection Rater Model , 2011 .

[3]  Richard R. Sudweeks,et al.  A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing , 2004 .

[4]  Brian W. Junker,et al.  The Hierarchical Rater Model for Rated Test Items and its Application to Large-Scale Educational Assessment Data , 2002 .

[5]  Ron I. Thomson,et al.  Rater Experience, Rating Scale Length, and Judgments of L2 Pronunciation: Revisiting Research Conventions , 2013 .

[6]  Alija Kulenović,et al.  Standards for Educational and Psychological Testing , 1999 .

[7]  Mary E. Lunz,et al.  Measuring the Impact of Judge Severity on Examination Scores , 1990 .

[8]  Machteld Hoskens,et al.  The Rater Bundle Model , 2001 .

[9]  J. Neyman,et al.  Consistent Estimates Based on Partially Consistent Observations , 1948 .

[10]  B. Wright,et al.  Construction of measures from many-facet data. , 2002, Journal of applied measurement.

[11]  Wen-Chung Wang,et al.  Using SAS PROC NLMIXED to fit item response theory models , 2005, Behavior research methods.

[12]  R E Schumacker Many-facet Rasch analysis with crossed, nested, and mixed designs. , 1999, Journal of outcome measurement.

[13]  Anthony S. Bryk,et al.  Hierarchical Linear Models: Applications and Data Analysis Methods , 1992 .

[14]  E. Muraki A Generalized Partial Credit Model: Application of an EM Algorithm , 1992 .

[15]  Gad S. Lim The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters , 2011 .

[16]  F. Baker,et al.  Item response theory : parameter estimation techniques , 1993 .

[17]  H M Taherbhai,et al.  The impact of rater effects on weighted composite scores under nested and spiraled scoring designs, using the multifaceted Rasch model. , 2001, Journal of outcome measurement.

[18]  Xiaoming Xi,et al.  HOW DO RATERS FROM INDIA PERFORM IN SCORING THE TOEFL IBT™ SPEAKING SECTION AND WHAT KIND OF TRAINING HELPS? , 2009 .

[19]  M. Miller,et al.  Measurement and Assessment in Teaching , 1994 .

[20]  Judit Kormos,et al.  The Effect of Mode of Response on a Semidirect Test of Oral Proficiency , 2011 .

[21]  G. Karabatsos,et al.  Hierarchical Generalized Linear Models for the Analysis of Judge Ratings. , 2009 .

[22]  S. Embretson,et al.  Item response theory for psychologists , 2000 .

[23]  R. Hambleton,et al.  Item Response Theory: Principles and Applications , 1984 .

[24]  Xiaoming Xi,et al.  Evaluating analytic scoring for the TOEFL® Academic Speaking Test (TAST) for operational use , 2007 .