Rater types in writing performance assessments: A classification approach to rater variability

Research on rater effects in language performance assessments has provided ample evidence for a considerable degree of variability among raters. Building on this research, I advance the hypothesis that experienced raters fall into types or classes that are clearly distinguishable from one another with respect to the importance they attach to scoring criteria. To examine the rater type hypothesis, I asked 64 raters actively involved in scoring examinee writing performance on a large-scale assessment instrument to indicate on a four-point scale how much importance they would attach to each of nine routinely used criteria. The criteria covered various performance aspects, such as fluency, completeness, and grammatical correctness. In a preliminary step, many-facet Rasch analysis revealed that raters differed significantly in their views on the importance of the various criteria. A two-mode clustering technique yielded a joint classification of raters and criteria, with six rater types emerging from the analysis. Each of these types was characterized by a distinct scoring profile, indicating that raters were far from dividing their attention evenly among the set of criteria. Moreover, rater background variables were shown to partially account for the scoring profile differences. The findings have implications for assessing the quality of large-scale rater-mediated language testing, rater monitoring, and rater training.

[1]  Eli Hinkel Native and Nonnative Speakers' Pragmatic Interpretations of English Texts. , 1994 .

[2]  Brian K. Lynch,et al.  Investigating variability in tasks and rater judgements in a performance test of foreign language speaking , 1995 .

[3]  Jessica Williams,et al.  Exploring the Dynamics of Second Language Writing , 2004 .

[4]  H. Chow "Holistic assessment : what goes on in the raters' minds?" , 1997 .

[5]  Hans-Hermann Bock,et al.  Two-mode clustering methods: astructuredoverview , 2004, Statistical methods in medical research.

[6]  Alison Green,et al.  Verbal Protocol Analysis in Language Testing Research: A Handbook , 1998 .

[7]  Louis Guttman,et al.  A STRUCTURAL THEORY FOR INTERGROUP BELIEFS AND ACTION , 1959 .

[8]  W. Castillo,et al.  Recurrence Properties in Two-Mode Hierarchical Clustering , 2000 .

[9]  Edward W. Wolfe,et al.  The relationship between essay reading style and scoring proficiency in a psychometric scoring system , 1997 .

[10]  Thomas Eckes Beurteilerübereinstimmung und Beurteilerstrenge , 2004 .

[11]  Wolfgang Gaul,et al.  From Data to Knowledge: Theoretical and Practical Aspects of Classification, Data Analysis, and Knowledge Organization , 1996 .

[12]  Lawrence T. DeCarlo,et al.  A Model of Rater Behavior in Essay Grading Based on Signal Detection Theory , 2005 .

[13]  Mary L. DeRemer,et al.  Writing assessment: Raters' elaboration of the rating task , 1998 .

[14]  Michael Ranney,et al.  Cognitive Differences in Proficient and Nonproficient Essay Scorers , 1998 .

[15]  Tom Lumley,et al.  Rater characteristics and rater bias: implications for training , 1995 .

[16]  Alister Cumming,et al.  Decision Making While Rating ESL/EFL Writing Tasks: A Descriptive Framework. , 2002 .

[17]  L. Hamp-Lyons Exploring the Dynamics of Second Language Writing: Writing teachers as assessors of writing , 2003 .

[18]  Thomas Eckes,et al.  Examining Rater Effects in TestDaF Writing and Speaking Performance Assessments: A Many-Facet Rasch Analysis , 2005 .

[19]  S. Freedman How Characteristics of Student Essays Influence Teachers ' Evaluations , 2005 .

[20]  B. Huot,et al.  Validating holistic scoring for writing assessment : theoretical and empirical foundations , 1993 .

[21]  S. Cushing Using FACETS to model rater training effects , 1998 .

[22]  Carol M. Myford,et al.  READER CALIBRATION AND ITS POTENTIAL ROLE IN EQUATING FOR THE TEST OF WRITTEN ENGLISH , 1995 .

[23]  Thomas Eckes,et al.  An Agglomerative Method for Two-Mode Hierarchical Clustering , 1991 .

[24]  Edward W Wolfe,et al.  Detecting and measuring rater effects using many-facet Rasch measurement: part I. , 2003, Journal of applied measurement.

[25]  Tom Lumley,et al.  Research methods in language testing , 2005 .

[26]  P. Orlik,et al.  An error variance approach to two-mode hierarchical clustering , 1993 .

[27]  Alfred Appiah Sakyi Validation of holistic scoring for ESL writing assessment: How raters evaluate compositions , 2000 .

[28]  Rob Schoonen,et al.  Generalizability of writing scores: an application of structural equation modeling , 2005 .

[29]  Edward W Wolfe,et al.  Detecting and measuring rater effects using many-facet Rasch measurement: Part II. , 2004, Journal of applied measurement.

[30]  T. McNamara Measuring Second Language Performance , 1996 .

[31]  Tom Lumley,et al.  Assessing second language writing : the rater's perspective , 2005 .

[32]  L. Hubert,et al.  Additive two-mode clustering: The error-variance approach revisited , 1995 .

[33]  R. M. Smith,et al.  Fit analysis in latent trait measurement models. , 2000, Journal of applied measurement.

[34]  George Engelhard,et al.  Examining Rater Errors in the Assessment of Written Composition With a Many-Faceted Rasch Model , 1994 .

[35]  Daniel J. Reed,et al.  Revisiting raters and ratings in oral language assessment , 2001 .

[36]  Sara Cushing Weigle,et al.  Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches , 1999 .

[37]  Cyril J. Weir,et al.  Language Testing and Validation , 2005 .

[38]  T. McNamara Item Response Theory and the validation of an ESP test for health professionals , 1990 .

[39]  P. Congdon,et al.  The Stability of Rater Severity in Large‐Scale Assessment Programs , 2000 .

[40]  Thomas Eckes,et al.  Recent Developments in Multimode Clustering , 1996 .

[41]  M. Vergeer,et al.  The assessment of writing ability: expert readers versus lay readers , 1997 .

[42]  George Engelhard,et al.  MONITORING FACULTY CONSULTANT PERFORMANCE IN THE ADVANCED PLACEMENT ENGLISH LITERATURE AND COMPOSITION PROGRAM WITH A MANY-FACETED RASCH MODEL , 2003 .

[43]  Lyle F. Bachman,et al.  Language testing in practice : designing and developing useful language tests , 1996 .

[44]  F. Lievens,et al.  Assessor training strategies and their effects on accuracy, interrater reliability, and discriminant validity. , 2001, The Journal of applied psychology.

[45]  Sara Cushing Weigle,et al.  Effects of training on raters of ESL compositions , 1994 .

[46]  Alister Cumming,et al.  Expertise in evaluating second language compositions , 1990 .

[47]  William T. Hoyt,et al.  Magnitude and moderators of bias in observer ratings: A meta-analysis. , 1999 .

[48]  S. Barrett The impact of training on rater variability , 2001 .

[49]  Sara Cushing Weigle,et al.  Using FACETS to model rater training effects , 1998 .

[50]  Individual Feedback to Enhance Rater Training: Does It Work? , 2005 .

[51]  John M Linacre,et al.  Optimizing rating scale category effectiveness. , 2002, Journal of applied measurement.

[52]  Elaine D. Pulakos,et al.  The development of training programs to increase accuracy with different rating tasks , 1986 .

[53]  James Dean Brown Do English and ESL Faculties Rate Writing Samples Differently , 1991 .

[54]  T. Lumley Assessment criteria in a large-scale writing test: what do they really mean to the raters? , 2002 .

[55]  Mohammad Ali,et al.  Performance Assessment in Language Testing. , 2008 .

[56]  C. Weir Language Testing and Validation: An Evidence-Based Approach , 2004 .