Calibration with confidence: a principled method for panel assessment

Frequently, a set of objects has to be evaluated by a panel of assessors, but not every object is assessed by every assessor. A problem facing such panels is how to take into account different standards among panel members and varying levels of confidence in their scores. Here, a mathematically based algorithm is developed to calibrate the scores of such assessors, addressing both of these issues. The algorithm is based on the connectivity of the graph of assessors and objects evaluated, incorporating declared confidences as weights on its edges. If the graph is sufficiently well connected, relative standards can be inferred by comparing how assessors rate objects they assess in common, weighted by the levels of confidence of each assessment. By removing these biases, ‘true’ values are inferred for all the objects. Reliability estimates for the resulting values are obtained. The algorithm is tested in two case studies: one by computer simulation and another based on realistic evaluation data. The process is compared to the simple averaging procedure in widespread use, and to Fisher's additive incomplete block analysis. It is anticipated that the algorithm will prove useful in a wide variety of situations such as evaluation of the quality of research submitted to national assessment exercises; appraisal of grant proposals submitted to funding panels; ranking of job applicants; and judgement of performances on degree courses wherein candidates can choose from lists of options.

[1]  Carl D. Meyer,et al.  Matrix Analysis and Applied Linear Algebra , 2000 .

[2]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[3]  Nicholas Graves,et al.  Funding grant proposals for scientific research: retrospective analysis of scores by members of grant review panel , 2011, BMJ : British Medical Journal.

[4]  M. Fiedler Algebraic connectivity of graphs , 1973 .

[5]  R. Fisher An examination of the different possible solutions of a problem in incomplete blocks. , 1940 .

[6]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[7]  Lynn S. Fuchs,et al.  Test Procedure Bias: A Meta-Analysis of Examiner Familiarity Effects , 1986 .

[8]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[9]  D. Altman,et al.  STATISTICAL METHODS FOR ASSESSING AGREEMENT BETWEEN TWO METHODS OF CLINICAL MEASUREMENT , 1986, The Lancet.

[10]  Peter A. Flach,et al.  Novel tools to streamline the conference review process: experiences from SIGKDD'09 , 2010, SKDD.

[11]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[12]  R. R. Snell,et al.  Menage a Quoi? Optimal Number of Peer Reviewers , 2015, PloS one.

[13]  F G Giesbrecht Analysis of data from incomplete block designs. , 1986, Biometrics.

[14]  Giesbrecht Fg,et al.  Analysis of data from incomplete block designs. , 1986 .

[15]  Ralph Kenna,et al.  Normalization of peer-evaluation measures of group research quality across academic disciplines , 2011 .

[16]  Edward W. Wolfe,et al.  Relationship between Rater Background and Rater Performance , 2014 .

[17]  Tormod Næs,et al.  Statistics for Sensory and Consumer Science , 2010 .

[18]  Tormod Næs,et al.  Statistics for Sensory and Consumer Science: Naes/Statistics for Sensory and Consumer Science , 2010 .

[19]  Patrick E. McKnight,et al.  Grant Peer Review: Improving Inter-Rater Reliability with Training , 2015, PloS one.

[20]  Ralph Kenna,et al.  Normalization of research evaluation results across academic disciplines , 2010, ArXiv.

[21]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[22]  Sudhir R. Paul Bayesian methods for calibration of examiners , 1981 .

[23]  Michelle Meadows,et al.  Can we predict who will be a reliable marker , 2006 .