What makes marking reliable? Experiments with UK examinations

Marking reliability is purported to be produced by having an effective community of practice. No experimental research has been identified which attempts to verify empirically the aspects of a community of practice that have been observed to produce marking reliability. This research outlines what that community of practice might entail and presents two experimental studies on the effects of particular aspects of community of practice on examiners' marking reliability. In the first study, the impact of exemplar work is investigated: examiners were provided with mark schemes and some examiners were provided with exemplar scripts and given feedback about the marking of those scripts. In the second study, the effects of discussion of the mark scheme are explored: all examiners received mark schemes and exemplar scripts, but some examiners did not attend a coordination meeting. Neither procedure (use of exemplar scripts or discussion between examiners) demonstrated an improvement in marking reliability, which called into question the predictive utility of the theory of community of practice.

[1]  Gregory Strong A Survey of Issues and Item Writing in Language Testing , 1995 .

[2]  Mary E. Lunz,et al.  Measuring the Impact of Judge Severity on Examination Scores , 1990 .

[3]  Sandra Murphy,et al.  Designing Writing Tasks for the Assessment of Writing , 1988 .

[4]  E. Black The marking of G.C.E. scripts , 1962 .

[5]  Chris Browning,et al.  The Effect of Clear Evaluation Criteria on Sex Bias in Judgments of Performance , 1983 .

[6]  A. Wolf Competence-Based Assessment , 1995 .

[7]  Jackie Greatorex,et al.  ' Tools for the trade ' : What makes GCSE marking reliable ? , 2002 .

[8]  William E. Coffman,et al.  A Comparison of Two Methods of Reading Essay Examinations , 1968 .

[9]  H. Wainer,et al.  Annual Meeting of the American Educational Research Association , 1998 .

[10]  Nigel O'Brian,et al.  Generalizability Theory I , 2003 .

[11]  Sara Cushing Weigle,et al.  Using FACETS to model rater training effects , 1998 .

[12]  E. Wenger Communities of Practice: Learning, Meaning, and Identity , 1998 .

[13]  E. Rhodes,et al.  The Marks of Examiners. , 1936 .

[14]  James Elander,et al.  An application of judgment analysis to examination marking in psychology. , 2002, British journal of psychology.

[15]  Jack Wrigley,et al.  Essay-Reliability: the Effect of Choice of Essay-Title , 1958 .

[16]  Margaret Spear,et al.  The influence of contrast effects upon teachers’ marks , 1997 .

[17]  J. H. McMillan Annual Meeting of the American Educational Research , 2001 .

[18]  Paul E. Newton,et al.  The Reliability of Marking of General Certificate of Secondary Education Scripts: mathematics and English , 1996 .

[19]  Sara Cushing Weigle,et al.  Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches , 1999 .

[20]  Mary E. Lunz,et al.  Judge Consistency and Severity Across Grading Periods , 1990 .

[21]  Gillian Wigglesworth Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction , 1993 .

[22]  D. Royce Sadler,et al.  Specifying and Promulgating Achievement Standards , 1987 .

[23]  Harry Torrance,et al.  Assessment and Testing: A Survey of Research , 1993 .

[24]  R. Murphy,et al.  A Further Report of Investigations into the Reliability of Marking of GCE Examinations. , 1982 .

[25]  Etienne Wenger,et al.  Situated Learning: Legitimate Peripheral Participation , 1991 .

[26]  Kathy Hall,et al.  Level descriptions and teacher assessment in England: towards a community of assessment practice , 2002 .

[27]  Etienne Wenger,et al.  Communities of Practice: Learning, Meaning, and Identity , 1998 .

[28]  Mary E. Lunz,et al.  A Longitudinal Study of Judge Leniency and Consistency. , 1997 .

[29]  R. Murphy,et al.  Reliability of Marking in Eight GCE Examinations. , 1978 .

[30]  John Bell,et al.  Generalizability Theory: The Software Problem , 1985 .

[31]  Kathryn Ecclestone,et al.  'I know a 2:1 when I see it': Understanding criteria for degree classifications in franchised university programmes , 2001 .

[32]  S. Cushing Using FACETS to model rater training effects , 1998 .

[33]  Pinot de Moira,et al.  Examiner Background and the Effect on Marking Reliability , 2003 .

[34]  Stephen B. Dunbar,et al.  Quality Control in the Development and Use of Performance Assessments , 1991 .

[35]  Claire Massey,et al.  Marking Consistency over Time , 2002 .