When the raters of constructed-response items, such as writing samples, disagree on the level of proficiency exhibited in an item, testing agencies must resolve the score discrepancy before computing an operational score for release to the public. Several forms of score resolution are used throughout the assessment industry. In this study, we selected 4 of the more common forms of score resolution that were reported in a national survey of testing agencies and investigated the effect that each form of resolution has on the interrater reliability associated with the resulting operational scores. It is shown that some forms of resolution can be associated with higher reliability than other forms and that some forms may be associated with artificially inflated interrater reliability. Moreover, it is shown that the choice of resolution method may affect the percentage of papers that are defined as passing in a high-stakes assessment.
[1]
Robert L. Brennan,et al.
NCME instructional module: Generalizability theory.
,
1992
.
[2]
Hunter M. Breland.
The Direct Assessment of Writing Skill: A Measurement Review
,
1983
.
[3]
E. Glen,et al.
Council of Chief State School Officers National Conference on Large-Scale Assessment
,
1997
.
[4]
Richard J. Shavelson,et al.
Generalizability Theory: A Primer
,
1991
.
[5]
Lee J. Cronbach,et al.
UCLA's Center for the Study of Evaluation & The National Center for Research on Evaluation, Standards, and Student Testing Generalizability Analysis for Educational Assessments 1
,
1995
.