An Inter-Rater Reliability Analysis of Good's Program Summary Analysis Scheme

In computer science education and research into the psychology of programming, program summary analysis has been used to characterize mental models of novice and expert programmers and to measure learning outcome of programs and programming concepts. This paper reports an investigation where three raters used Good's program summary analysis scheme consisting of two independent classifications of program summary segments: information types and object description categories. The problems in using the scheme as well as differences between the raters were recorded and analyzed. The findings indicate that by improving the scheme and its documentation, most of the observed inter-rater differences can be avoided. The only open problem concerns making the distinction between descriptions of data and activities in cases where the specific words that are used, or the abstractness of expression may affect raters' interpretation of the information type.