Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds

BackgroundIndices of inter-evaluator reliability are used in many fields such as computational linguistics, psychology, and medical science; however, the interpretation of resulting values and determination of appropriate thresholds lack context and are often guided only by arbitrary “rules of thumb” or simply not addressed at all. Our goal for this work was to develop a method for determining the relationship between inter-evaluator agreement and error to facilitate meaningful interpretation of values, thresholds, and reliability.MethodsThree expert human evaluators completed a video analysis task, and averaged their results together to create a reference dataset of 300 time measurements. We simulated unique combinations of systematic error and random error onto the reference dataset to generate 4900 new hypothetical evaluators (each with 300 time measurements). The systematic errors and random errors made by the hypothetical evaluator population were approximated as the mean and variance of a normally-distributed error signal. Calculating the error (using percent error) and inter-evaluator agreement (using Krippendorff’s alpha) between each hypothetical evaluator and the reference dataset allowed us to establish a mathematical model and value envelope of the worst possible percent error for any given amount of agreement.ResultsWe used the relationship between inter-evaluator agreement and error to make an informed judgment of an acceptable threshold for Krippendorff’s alpha within the context of our specific test. To demonstrate the utility of our modeling approach, we calculated the percent error and Krippendorff’s alpha between the reference dataset and a new cohort of trained human evaluators and used our contextually-derived Krippendorff’s alpha threshold as a gauge of evaluator quality. Although all evaluators had relatively high agreement (> 0.9) compared to the rule of thumb (0.8), our agreement threshold permitted evaluators with low error, while rejecting one evaluator with relatively high error.ConclusionsWe found that our approach established threshold values of reliability, within the context of our evaluation criteria, that were far less permissive than the typically accepted “rule of thumb” cutoff for Krippendorff’s alpha. This procedure provides a less arbitrary method for determining a reliability threshold and can be tailored to work within the context of any reliability index.

[1]  J. Bartlett,et al.  Reliability, repeatability and reproducibility: analysis of measurement errors in continuous variables , 2008, Ultrasound in obstetrics & gynecology : the official journal of the International Society of Ultrasound in Obstetrics and Gynecology.

[2]  Tatyana Shatalova,et al.  On the choice of measures of reliability and validity in the content-analysis of texts , 2014 .

[3]  Klaus Krippendorff,et al.  Answering the Call for a Standard Reliability Measure for Coding Data , 2007 .

[4]  P. Neven,et al.  Inter-rater reliability of shoulder measurements in middle-aged women. , 2017, Physiotherapy.

[5]  Roberto Revetria,et al.  Monte Carlo Simulation Models Evolving in Replicated Runs: A Methodology to Choose the Optimal Experimental Sample Size , 2012 .

[6]  Mary McGee Wood,et al.  Squibs and Discussions: Evaluating Discourse and Dialogue Coding Schemes , 2005, CL.

[7]  E. Bartels,et al.  Reliability of Pain Measurements Using Computerized Cuff Algometry: A DoloCuff Reliability and Agreement Study , 2017, Pain practice : the official journal of World Institute of Pain.

[8]  Jean Carletta,et al.  Squibs: Reliability Measurement without Limits , 2008, CL.

[9]  Jean-Yves Antoine,et al.  Weighted Krippendorff’s alpha is a more reliable metrics for multi-coders ordinal annotations: experimental studies on emotion, opinion and coreference annotation , 2014, EACL.

[10]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[11]  P. Vacek,et al.  A protocol for the Hamilton Rating Scale for Depression: Item scoring rules, Rater training, and outcome accuracy with data on its application in a clinical trial. , 2016, Journal of affective disorders.

[12]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[13]  E. Wikstrom,et al.  Reliability of two-point discrimination thresholds using a 4-2-1 stepping algorithm , 2016, Somatosensory & motor research.

[14]  Barbara Maria Di Eugenio,et al.  Squibs and Discussions - The Kappa Statistic , 2004 .

[15]  K. Krippendorff Krippendorff, Klaus, Content Analysis: An Introduction to its Methodology . Beverly Hills, CA: Sage, 1980. , 1980 .

[16]  S. Walter,et al.  Sample size and optimal designs for reliability studies. , 1998, Statistics in medicine.

[17]  M. Banerjee,et al.  Beyond kappa: A review of interrater agreement measures , 1999 .

[18]  H. Endeman,et al.  Validation of the Dutch version of the critical-care pain observation tool. , 2019, Nursing in critical care.

[19]  Klaus Krippendorff,et al.  Agreement and Information in the Reliability of Coding , 2011 .

[20]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[21]  Guangchao Charles Feng,et al.  Factors affecting intercoder reliability: a Monte Carlo experiment , 2013 .

[22]  C. Cooper,et al.  Inter-rater reliability of distal ureteral diameter ratio compared to grade of VUR. , 2016, Journal of Pediatric Urology.

[23]  E. R. Cohen An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements , 1998 .

[24]  T. Sozu,et al.  Effective number of subjects and number of raters for inter‐rater reliability studies , 2006, Statistics in medicine.

[25]  G. Bonsel,et al.  Feasibility and reliability of a newly developed antenatal risk score card in routine care. , 2015, Midwifery.

[26]  Anne Garrison Wilhelm,et al.  Exploring Differences in Measurement and Reporting of Classroom Observation Inter-Rater Reliability. , 2018 .

[27]  Klaus Krippendorff,et al.  Computing Krippendorff's Alpha-Reliability , 2011 .

[28]  Deen Freelon,et al.  ReCal OIR : Ordinal , Interval , and Ratio Intercoder Reliability as a Web Service , 2013 .

[29]  K. Krippendorff Reliability in Content Analysis: Some Common Misconceptions and Recommendations , 2004 .

[30]  M. Lombard,et al.  Content Analysis in Mass Communication: Assessment and Reporting of Intercoder Reliability , 2002 .

[31]  Barbara Di Eugenio,et al.  Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[32]  Rebecca Zwick,et al.  Another look at interrater agreement. , 1988, Psychological bulletin.