Benchmarking Kappa: Interrater Agreement in Software Process Assessments

Software process assessments are by now a prevalent tool for process improvement and contract risk assessment in the software industry. Given that scores are assigned to processes during an assessment, a process assessment can be considered a subjective measurement procedure. As with any subjective measurement procedure, the reliability of process assessments has important implications on the utility of assessment scores, and therefore the reliability of assessments can be taken as a criterion for evaluating an assessment's quality. The particular type of reliability of interest in this paper is interrater agreement. Thus far, empirical evaluations of the interrater agreement of assessments have used Cohen's Kappa coefficient. Once a Kappa value has been derived, the next question is “how good is it?” Benchmarks for interpreting the obtained values of Kappa are available from the social sciences and medical literature. However, the applicability of these benchmarks to the software process assessment context is not obvious. In this paper we develop a benchmark for interpreting Kappa values using data from ratings of 70 process instances collected from assessments of 19 different projects in 7 different organizations in Europe during the SPICE Trials (this is an international effort to empirically evaluate the emerging ISO/IEC 15504 International Standard for Software Process Assessment). The benchmark indicates that Kappa values below 0.45 are poor, and values above 0.62 constitute substantial agreement and should be the minimum aimed for. This benchmark can be used to decide how good an assessment's reliability is.

[1]  A. Ehrenberg,et al.  The Design of Replicated Studies , 1993 .

[2]  Hoi K. Suen,et al.  Effects of the use of percentage agreement on behavioral observation reliabilities: A reassessment , 1985 .

[3]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[4]  R. Alpert,et al.  Communications Through Limited-Response Questioning , 1954 .

[5]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[6]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[7]  Khaled El Emam,et al.  Spice: The Theory and Practice of Software Process Improvement and Capability Determination , 1997 .

[8]  Lionel C. Briand,et al.  Assessor agreement in rating SPICE processes , 1996, Softw. Process. Improv. Pract..

[9]  Khaled El Emam,et al.  The repeatability of code defect classifications , 1998, Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257).

[10]  B. Kotkov,et al.  Test scores and what they mean , 1963 .

[11]  D P Hartmann,et al.  Considerations in the choice of interobserver reliability estimates. , 1977, Journal of applied behavior analysis.

[12]  Bob Smith,et al.  Evaluating the interrater agreement of process capability ratings , 1997, Proceedings Fourth International Software Metrics Symposium.

[13]  Khaled El Emam,et al.  The Internal Consistency of the ISO/IEC 15504 Software Process Capability Scale , 1998, IEEE METRICS.

[14]  Pasi Kuvaja,et al.  Bootstrap 3.0 — Software Process Assessment Methodology , 1998 .

[15]  Squires Bp,et al.  Statistics in biomedical manuscripts: what editors want from authors and peer reviewers , 1989 .

[16]  S. Jaggi TESTS OF SIGNIFICANCE , 2003 .

[17]  Khaled El Emam,et al.  Cost implications of interrater agreement for software process assessments , 1998, Proceedings Fifth International Software Metrics Symposium. Metrics (Cat. No.98TB100262).

[18]  James Joseph Biundo,et al.  Analysis of Contingency Tables , 1969 .

[19]  Khaled El Emam,et al.  The reliability of ISO/IEC PDTR 15504 assessments , 1997, Softw. Process. Improv. Pract..

[20]  Dennis R. Goldenson,et al.  SPICE: an empiricist's perspective , 1995, Proceedings of Software Engineering Standards Symposium.

[21]  Robert C. Camp,et al.  Benchmarking: The Search for Industry Best Practices That Lead to Superior Performance , 1989 .

[22]  Douglas G. Altman,et al.  Practical statistics for medical research , 1990 .

[23]  Dennis R. Goldenson,et al.  Interrater agreement in SPICE-based assessments: some preliminary results , 1996, Proceedings of Software Process 1996.

[24]  R. Zwick,et al.  Another look at interrater agreement. , 1988, Psychological bulletin.

[25]  H. Lyman Test Scores and What They Mean , 1971 .

[26]  W. A. Scott,et al.  Reliability of Content Analysis ; The Case of Nominal Scale Cording , 1955 .

[27]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[28]  Jacob Cohen,et al.  The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability , 1973 .

[29]  Khaled El Emam,et al.  An Empirical Evaluation of the Prospective International SPICE Standard , 1996, Softw. Process. Improv. Pract..

[30]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[31]  B. Squires Statistics in biomedical manuscripts: what editors want from authors and peer reviewers. , 1990, CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne.

[32]  Bob Smith,et al.  The Internal Consistencies of the 1987 SEI Maturity Questionnaire and the SPICE Capability Dimension , 1998, Empirical Software Engineering.

[33]  Lionel C. Briand,et al.  Using simulation to build inspection efficiency benchmarks for development projects , 1998, Proceedings of the 20th International Conference on Software Engineering.

[34]  Anne Lohrli Chapman and Hall , 1985 .

[35]  Domenic V. Cicchetti A new measure of agreement between rank ordered variables. , 1972 .

[36]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[37]  R. Peterson,et al.  Interjudge Agreement and the Maximum Value of Kappa , 1989 .

[38]  Bob Smith,et al.  Modelling the reliability of SPICE based assessments , 1997, Proceedings of IEEE International Symposium on Software Engineering Standards.

[39]  B. Everitt,et al.  Large sample standard errors of kappa and weighted kappa. , 1969 .

[40]  Hans Zeisel The Significance of Insignificant Differences , 1955 .