Evaluating interrater agreement with intraclass correlation coefficient in SPICE-based software process assessment

As software process assessment (SPA) involves a subjective procedure, its reliability is an important issue. Two types of reliability have intensively been investigated in SPA: internal consistency (internal reliability) and interrater agreement (external reliability). This study investigates interrater agreement. Cohen's Kappa coefficient has been a popular measure for estimating interrater agreement. However, the application of Kappa coefficient in certain situations is incorrect due to the "Kappa Paradoxes". To cope with the insufficiency of Kappa coefficient, this study applied the intraclass correlation coefficient (ICC) to estimate interrater agreement. The ICC has not been employed in the SPA context. In addition, we examined the stability of the estimated ICC value by using a bootstrap resampling method. Results show that ICC could be applied where the Kappa coefficient could not be applied, but not all cases.

[1]  Khaled El Emam,et al.  Benchmarking Kappa: Interrater Agreement in Software Process Assessments , 1999, Empirical Software Engineering.

[2]  Bob Smith,et al.  Modelling the reliability of SPICE based assessments , 1997, Proceedings of IEEE International Symposium on Software Engineering Standards.

[3]  E. A. Haggard Intraclass correlation and the analysis of variance. , 1960 .

[4]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[5]  Lionel C. Briand,et al.  Assessor agreement in rating SPICE processes , 1996, Softw. Process. Improv. Pract..

[6]  K. McGraw,et al.  Forming inferences about some intraclass correlation coefficients. , 1996 .

[7]  Dennis R. Goldenson,et al.  Interrater agreement in SPICE-based assessments: some preliminary results , 1996, Proceedings of Software Process 1996.

[8]  A. Feinstein,et al.  High agreement but low kappa: I. The problems of two paradoxes. , 1990, Journal of clinical epidemiology.

[9]  Khaled El Emam,et al.  Findings from Phase 2 of the SPICE trials , 2001, Softw. Process. Improv. Pract..

[10]  Bob Smith,et al.  The Internal Consistencies of the 1987 SEI Maturity Questionnaire and the SPICE Capability Dimension , 1998, Empirical Software Engineering.

[11]  R. L. Ebel,et al.  Estimation of the reliability of ratings , 1951 .

[12]  Dennis R. Goldenson,et al.  The Internal Consistency of Key Process Areas in the Capability Maturity Model (CMM) for Software (SW-CMM) , 2002 .

[13]  C. Lunneborg Data Analysis by Resampling: Concepts and Applications , 1999 .

[14]  Khaled El Emam,et al.  Cost implications of interrater agreement for software process assessments , 1998, Proceedings Fifth International Software Metrics Symposium. Metrics (Cat. No.98TB100262).

[15]  E. A. Haggard,et al.  Intraclass Correlation and the Analysis of Variance , 1958 .

[16]  A. Feinstein,et al.  High agreement but low kappa: II. Resolving the paradoxes. , 1990, Journal of clinical epidemiology.

[17]  Ho-Won Jung,et al.  Evaluating interrater agreement in SPICE-based assessments , 2003, Comput. Stand. Interfaces.

[18]  K. Lee,et al.  Analysis of interrater agreement in ISO/IEC 15504-based software process assessment , 2001, Proceedings Second Asia-Pacific Conference on Quality Software.

[19]  Bob Smith,et al.  Evaluating the interrater agreement of process capability ratings , 1997, Proceedings Fourth International Software Metrics Symposium.