Reliability and Generalizability of Ratings of Compositions.
暂无分享,去创建一个
ABSTRACT A total of 1,487 eleventh grade students from the Hamburg (West Germany) school system were asked to complete four writing assignments used in an International Association for the Evaluation of Educational Achievement (IEA) study of writing assessment. In analyzing the writing samples, the study focused on: (1) between-rater effects; (2) within-rater effects; (3) between-assignment effects; and (4) within-student effects. Two independent sets of scores on a 5-point scale were awarded to each essay in accordance with the international scoring guides. Since a study of the above four factors involves a complex array of statistical analyses, researchers did not rely on parametric test models alone. They included more intuitive statistics such as percentage of perfect agreement between two independent ratings and percentage of loose agreement defined as the percentage of differences between two ratings not greater than one percentage point. Results showed that there was not a single or simple answer to the reliability of measuring general writing achievement across the tasks used. Instead, the solution depended on the kind of assumptions about the tasks that one is prepared to make. Statistically, the answer was a function of whether within-student variation was considered as true variance or error. It war also concluded that 13 writing tasks would have been needed to obtain satisfactory generalizability. (KSA)
[1] G. A. Ferguson,et al. Statistical analysis in psychology and education , 1960 .
[2] Karen L. Greenberg,et al. The IEA Study of Written Composition I: The International Writing Tasks and Scoring Scales , 1989 .