Assessing quality of selection procedures: Lower bound of false positive rate as a function of inter-rater reliability

Inter-rater reliability (IRR) is one of the commonly used tools for assessing the quality of ratings from multiple raters as it is easily obtainable from the observed ratings themselves. However, applicant selection procedures based on ratings from multiple raters usually result in a binary outcome; the applicant is either selected or not. This final outcome is not considered in IRR, which instead focuses on the ratings of the individual subjects or objects. In this work, we outline the connection between the ratings' measurement model (used for IRR) and a binary classification framework. We develop a quantile approximation which allows us to estimate the probability of correctly selecting the best applicants and compute error probabilities of the selection procedure (i.e., false-positive and false-negative rate) under the assumption of the ratings' validity. If the ratings are not completely valid, the computed error probabilities correspond to a lower bound on the true error probabilities. We draw connections between the inter-rater reliability and the binary classification metrics, showing that binary classification metrics depend solely on the IRR coefficient and proportion of selected applicants. We assess the performance of the quantile approximation in a simulation study and apply it in an example comparing the reliability of multiple grant peer review selection procedures.

[1]  M. Brabec,et al.  Assessing Inter-rater Reliability With Heterogeneous Variance Components Models: Flexible Approach Accounting for Contextual Variables , 2022, Journal of Educational and Behavioral Statistics.

[2]  E. Erosheva,et al.  A Unified Statistical Learning Model for Rankings and Scores with Application to Grant Panel Review , 2022, J. Mach. Learn. Res..

[3]  Elena A. Erosheva,et al.  When zero may not be zero: A cautionary note on the use of inter‐rater reliability in evaluating grant peer review , 2021, Journal of the Royal Statistical Society: Series A (Statistics in Society).

[4]  Carole J. Lee,et al.  NIH peer review: Criterion scores completely account for racial disparities in overall impact scores , 2020, Science Advances.

[5]  M. Brabec,et al.  Testing heterogeneity in inter-rater reliability , 2019, Springer Proceedings in Mathematics & Statistics.

[6]  Erik Cobo,et al.  Tools used to assess the quality of peer review reports: a methodological systematic review , 2019, BMC Medical Research Methodology.

[7]  E. Erosheva,et al.  Disparities in ratings of internal and external applicants: A case for model-based inter-rater reliability , 2018, PloS one.

[8]  D. Moher,et al.  Increasing the evidence base in journalology: creating an international best practice journal research network , 2016, BMC Medicine.

[9]  A. Casadevall,et al.  NIH peer review percentile scores are poorly predictive of grant productivity , 2016, eLife.

[10]  M. Lauer,et al.  Reviewing Peer Review at the NIH. , 2015, The New England journal of medicine.

[11]  Stephen A. Gallo,et al.  A retrospective analysis of the effect of discussion in teleconference and face-to-face scientific peer-review panels , 2015, BMJ Open.

[12]  Michael S Lauer,et al.  Predicting Productivity Returns on Investment: Thirty Years of Peer Review, Grant Funding, and Publication of Highly Cited Papers at the National Heart, Lung, and Blood Institute. , 2015, Circulation research.

[13]  M. D. Lindner,et al.  Examining the Predictive Validity of NIH Peer Review Scores , 2015, PloS one.

[14]  Danielle Li,et al.  Big names or big ideas: Do peer-review panels select the best science proposals? , 2015, Science.

[15]  D. Bates,et al.  Fitting Linear Mixed-Effects Models Using lme4 , 2014, 1406.5823.

[16]  Carole J. Lee A Kuhnian Critique of Psychometric Research on Peer Review , 2012 .

[17]  Lutz Bornmann,et al.  Heterogeneity of Inter-Rater Reliabilities of Grant Peer Reviews and Its Determinants: A General Estimating Equations Approach , 2012, PloS one.

[18]  David J. Bartholomew,et al.  Latent Variable Models and Factor Analysis: A Unified Approach , 2011 .

[19]  James Brophy,et al.  Peering at peer review revealed high degree of chance associated with funding of grant applications. , 2006, Journal of clinical epidemiology.

[20]  P. Qiu The Statistical Evaluation of Medical Tests for Classification and Prediction , 2005 .

[21]  Karl J. Friston,et al.  Variance Components , 2003 .

[22]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[23]  K. McGraw,et al.  Forming inferences about some intraclass correlation coefficients. , 1996 .

[24]  Lowell L. Hargens,et al.  Referee agreement in context , 1991, Behavioral and Brain Sciences.

[25]  Robert F. Bornstein,et al.  The predictive validity of peer review: A neglected issue , 1991, Behavioral and Brain Sciences.

[26]  Charles A. Kiesler,et al.  Confusion between reviewer reliability and wise editorial and funding decisions , 1991, Behavioral and Brain Sciences.

[27]  Helena C. Kraemer,et al.  Do we really want more “reliable” reviewers? , 1991, Behavioral and Brain Sciences.

[28]  L. Nelson The process of peer review: Unanswered questions , 1991, Behavioral and Brain Sciences.

[29]  D. Cicchetti The reliability of peer review for manuscript and grant submissions: A cross-disciplinary investigation , 1991, Behavioral and Brain Sciences.

[30]  John C. Bailar,et al.  Reliability, fairness, objectivity and other inappropriate goals in peer review , 1991, Behavioral and Brain Sciences.

[31]  J. Fleiss,et al.  Intraclass correlations: uses in assessing rater reliability. , 1979, Psychological bulletin.

[32]  Daryl E. Chubin,et al.  Competence is Not Enough@@@Peer Review in the National Science Foundation: Phase one of a Study. , 1980 .

[33]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[34]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[35]  L. Cronbach,et al.  Psychological tests and personnel decisions , 1958 .

[36]  Hubert E. Brogden,et al.  When Testing Pays Off , 1949 .

[37]  H. Taylor,et al.  The relationship of validity coefficients to the practical effectiveness of tests in selection: discussion and tables. , 1939 .

[38]  L. L. Thurstone,et al.  The reliability and validity of tests : derivation and interpretation of fundamental formulae concerned with reliability and validity of tests and illustrative problems , 1931 .