The Critical Role of Anchor Paper Selection in Writing Assessment

Scoring rubrics are routinely used to evaluate the quality of writing samples produced for writing performance assessments, with anchor papers chosen to represent score points defined in the rubric. Although the careful selection of anchor papers is associated with best practices for scoring, little research has been conducted on the role of anchor paper selection in writing assessment. This study examined the consequences of differential selection of anchor papers to represent a common scoring rubric. A set of writing samples was scored under two conditions—one using anchors selected from within grade and one using anchors selected from across three grade levels. Observed ratings were analyzed using three- and four-facet Rasch (one-parameter logistic) models. Ratings differed in magnitude and rank-order, with the difference presumed to be due to the anchor paper conditions and not a difference in overall severity between the rater groups. Results shed light on potential threats to validity within conventional context-dependent scoring practices and raise issues that have not been investigated with respect to the selection of anchor papers, such as the interpretation of results at different grade levels, implications for the assessment of progress over time, and the reliability of anchor paper selection within a scoring context.

[1]  Samuel Messick,et al.  STANDARDS OF VALIDITY AND THE VALIDITY OF STANDARDS IN PERFORMANCE ASSESSMENT , 2005 .

[2]  Robert L. Brennan,et al.  Generalizability of Performance Assessments , 2005 .

[3]  Tonya R. Moon,et al.  Training and Scoring Issues Involved in Large-Scale Writing Assessments , 2005 .

[4]  W. A. Mehrens Using Performance Assessment for Accountability Purposes , 2005 .

[5]  Nikki Elliott-Schuman,et al.  Grade 4 Anchor Set Annotations from the Spring 2001 Washington Assessment of Student Learning in Writing [with]"Presentation Guide" for Principals. , 2001 .

[6]  Margaret E. Goertz,et al.  Assessment and Accountability Systems in the 50 States, 1999-2000. CPRE Research Report Series. , 2001 .

[7]  Liru Zhang,et al.  Delaware Student Testing Program: Report on Special Writing Study. , 2000 .

[8]  Fred Stofflet,et al.  Improving the Validity and Reliability of Large Scale Writing Assessment. , 2000 .

[9]  Belita Gordon,et al.  The Relation Between Score Resolution Methods and Interrater Reliability: An Empirical Study of an Analytic Scoring Rubric , 2000 .

[10]  Michael Ranney,et al.  Cognitive Differences in Proficient and Nonproficient Essay Scorers , 1998 .

[11]  Sara Cushing Weigle,et al.  Using FACETS to model rater training effects , 1998 .

[12]  Vicki Spandel,et al.  Seeing with New Eyes: A Guidebook on Teaching and Assessing Beginning Writers. Third Edition. , 1996 .

[13]  Joan L. Herman,et al.  Establishing Validity for Performance-Based Assessments: An Illustration for Collections of Student Writing. , 1996 .

[14]  George Engelhard,et al.  Evaluating Rater Accuracy in Performance Assessments. , 1996 .

[15]  George Engelhard,et al.  Examining Rater Errors in the Assessment of Written Composition With a Many-Faceted Rasch Model , 1994 .

[16]  Gale H. Roid,et al.  Patterns of Writing Skills Derived From Cluster Analysis of Direct-Writing Assessments , 1994 .

[17]  George Engelhard,et al.  The Measurement of Writing Ability With a Many-Faceted Rasch Model , 1992 .

[18]  B. Huot,et al.  The Literature of Direct Writing Assessment: Major Concerns and Prevailing Trends , 1990 .

[19]  E. White Teaching and assessing writing , 1986 .

[20]  David C. Hughes,et al.  THE USE OF MODEL ESSAYS TO REDUCE CONTEXT EFFECTS IN ESSAY SCORING , 1984 .

[21]  Ina V. S. Mullis,et al.  Scoring Direct Writing Assessments: What Are the Alternatives?. , 1984 .

[22]  Edys S. Quellmalz,et al.  Toward Successful Large-Scale Writing Assessment: Where are we now? Where do we go from here? , 1984 .

[23]  G. Masters,et al.  Rating Scale Analysis. Rasch Measurement. , 1983 .

[24]  G. Masters A rasch model for partial credit scoring , 1982 .

[25]  Sarah Warshauer Freedman,et al.  Influences on Evaluators of Expository Essays: Beyond the Text , 1981, Research in the Teaching of English.

[26]  Charles R. Cooper,et al.  Procedures for Evaluating Writing: Assumptions and Needed Research , 1980, College English.

[27]  R. Downey,et al.  Rating the ratings: Assessing the psychometric quality of rating data , 1980 .

[28]  D. Andrich A rating formulation for ordered response categories , 1978 .

[29]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[30]  E. Wolfe,et al.  When raters disagree, then what: examining a third-rating discrepancy resolution procedure and its utility for identifying unusual patterns of ratings. , 2002, Journal of applied measurement.

[31]  Lydia Abell Driscoll 10 Steps to District Performance Assessment. , 1996 .

[32]  Belita Gordon,et al.  Conceptual issues in equating performance assessments : lessons from writing assessment , 1996 .

[33]  B. Wright Reasonable mean-square fit values , 1994 .

[34]  A. Feinstein,et al.  High agreement but low kappa: I. The problems of two paradoxes. , 1990, Journal of clinical epidemiology.

[35]  Donald A. Daiker,et al.  The Selection and Use of Sample Papers in Holistic Evaluation. , 1985 .

[36]  E. Thorndike A constant error in psychological ratings. , 1920 .