Unfolding the phenomenon of interrater agreement: a multicomponent approach for in-depth examination was proposed.

OBJECTIVE The overall objective was to unfold the phenomenon of interrater agreement: to identify potential sources of variation in agreement data and to explore how they can be statistically accounted for. The ultimate aim was to propose recommendations for in-depth examination of agreement to improve the reliability of assessment instruments. STUDY DESIGN AND SETTING Using a sample where 10 rater pairs had assessed the presence/absence of 188 environmental barriers by a systematic rating form, a raters × items data set was generated (N=1,880). In addition to common agreement indices, relative shares of agreement variation were calculated. Multilevel regression analysis was carried out, using rater and item characteristics as predictors of agreement variation. RESULTS Following a conceptual decomposition, the agreement variation was statistically disentangled into relative shares. The raters accounted for 6-11%, the items for 32-33%, and the residual for 57-60% of the variation. Multilevel regression analysis showed barrier prevalence and raters' familiarity with using standardized instruments to have the strongest impact on agreement. CONCLUSION Supported by a conceptual analysis, we propose an approach of in-depth examination of agreement variation, as a strategy for increasing the level of interrater agreement. By identifying and limiting the most important sources of disagreement, instrument reliability can be improved ultimately.

[1]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[2]  Laura Punnett,et al.  Inter-rater reliability of PATH observations for assessment of ergonomic risk factors in hospital work , 2009, Ergonomics.

[3]  Tina Helle,et al.  The Nordic Housing Enabler: Inter-rater reliability in cross-Nordic occupational therapy practice , 2010, Scandinavian journal of occupational therapy.

[4]  Susanne Iwarsson,et al.  Importance of the home environment for healthy aging: conceptual and methodological background of the European ENABLE-AGE Project. , 2007, The Gerontologist.

[5]  H. Kraemer Ramifications of a population model forκ as a coefficient of reliability , 1979 .

[6]  Nigel O'Brian,et al.  Generalizability Theory I , 2003 .

[7]  Susanne Iwarsson,et al.  The Housing Enabler. An Instrument for Assessing and Analysing Accessibility Problems in Housing. , 2001 .

[8]  E. Waclawski Health Measurement Scales—A Practical Guide to Their Development and Use , 2010 .

[9]  S. Hasson,et al.  Inter-rater reliability of the Modified Modified Ashworth Scale in assessing lower limb muscle spasticity , 2009, Brain injury.

[10]  F. Hoehler Bias and prevalence effects on kappa viewed in terms of sensitivity and specificity. , 2000, Journal of clinical epidemiology.

[11]  Susanne Iwarsson,et al.  Cross-national and multi-professional inter-rater reliability of the Housing Enabler , 2005, Scandinavian journal of occupational therapy.

[12]  A. Eliasson,et al.  Interrater and intrarater reliability of the Assisting Hand Assessment. , 2007, The American journal of occupational therapy : official publication of the American Occupational Therapy Association.

[13]  A. Hrõbjartsson,et al.  Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. , 2011, Journal of clinical epidemiology.

[14]  Donald B. Rubin,et al.  The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. , 1974 .

[15]  Tom A. B. Snijders,et al.  Multilevel Analysis , 2011, International Encyclopedia of Statistical Science.

[16]  C. Hawes,et al.  Reliability and the measurement of activity limitations (ADLs) for children with special health care needs (CSHCN) living in the community , 2011, Disability and rehabilitation.

[17]  L. Broemeling Bayesian Methods for Measures of Agreement , 2009 .

[18]  H. Kraemer,et al.  2 x 2 kappa coefficients: measures of agreement or association. , 1989, Biometrics.

[19]  S. K. Mitchell Interobserver agreement, reliability, and generalizability of data collected in observational studies. , 1979 .

[20]  Judith D. Singer,et al.  Using SAS PROC MIXED to Fit Multilevel Models, Hierarchical Models, and Individual Growth Models , 1998 .

[21]  C. Terwee,et al.  When to use agreement versus reliability measures. , 2006, Journal of clinical epidemiology.

[22]  J. Fleiss,et al.  Statistical methods for rates and proportions , 1973 .

[23]  C. Schwartz,et al.  Reconsidering the psychometrics of quality of life assessment in light of response shift and appraisal , 2004, Health and quality of life outcomes.

[24]  W. Conover Statistical Methods for Rates and Proportions , 1974 .

[25]  J. Kromrey,et al.  Dancing the Sample Size Limbo with Mixed Models: How Low Can You Go? , 2010 .

[26]  David L Streiner,et al.  The difference between reliability and agreement. , 2011, Journal of clinical epidemiology.

[27]  S. Kozlowski,et al.  A disagreement about within-group agreement: Disentangling issues of consistency versus consensus. , 1992 .

[28]  Susanne Iwarsson,et al.  Relationships between housing and healthy aging in very old age. , 2007, The Gerontologist.

[29]  P. Fayers,et al.  Quality of Life: Assessment, Analysis, and Interpretation , 2000 .

[30]  D. Streiner,et al.  Health measurement scales , 2008 .

[31]  Eva Nick,et al.  The dependability of behavioral measurements: theory of generalizability for scores and profiles , 1973 .

[32]  George Hripcsak,et al.  Measuring agreement in medical informatics reliability studies , 2002, J. Biomed. Informatics.

[33]  Lyle D. Broemeling A BAYESIAN ANALYSIS FOR INTER-RATER AGREEMENT , 2001 .