We need to talk about reliability: making better use of test-retest studies for study design and interpretation

Positron emission tomography (PET), along with many other fields of clinical research, is both timeconsuming and expensive, and recruitable patients can be scarce. These constraints limit the possibility of large-sample experimental designs, and often lead to statistically underpowered studies. This problem is exacerbated by the use of outcome measures whose accuracy is sometimes insufficient to answer the scientific questions posed. Reliability is usually assessed in validation studies using healthy participants, however these results are often not easily applicable to clinical studies examining different populations. I present a new method and tools for using summary statistics from previously published test-retest studies to approximate the reliability of outcomes in new samples. In this way, the feasibility of a new study can be assessed during planning stages, and before collecting any new data. An R package called relfeas also accompanies this article for performing these calculations. In summary, these methods and tools will allow researchers to avoid performing costly studies which are, by virtue of their design, unlikely to yield informative conclusions.

[1]  Elliot T. Berkman,et al.  Bias-Correction Techniques Alone Cannot Determine Whether Ego Depletion is Different from Zero: Commentary on Carter, Kofler, Forster, & McCullough, 2015 , 2015 .

[2]  G. Escaramís,et al.  Distinguishability and agreement with continuous data , 2014, Statistics in medicine.

[3]  Pia Rotshtein,et al.  Registered Reports: Realigning incentives in scientific publishing , 2015, Cortex.

[4]  Rik Crutzen,et al.  Knowing exactly how effective an intervention, treatment, or manipulation is and ensuring that a study replicates: accuracy in parameter estimation as a partial solution to the replication crisis , 2017 .

[5]  E. Wagenmakers,et al.  Meta-analyses are no substitute for registered replications: a skeptical perspective on religious priming , 2015, Front. Psychol..

[6]  Rupert Lanzenberger,et al.  Meta-analysis of molecular imaging of serotonin transporters in major depression , 2014, Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism.

[7]  K. McGraw,et al.  Forming inferences about some intraclass correlation coefficients. , 1996 .

[8]  Leslie G. Portney Dpt PhD Fapta,et al.  Foundations of Clinical Research: Applications to Practice , 2015 .

[9]  R. P. Maguire,et al.  Consensus Nomenclature for in vivo Imaging of Reversibly Binding Radioligands , 2007, Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism.

[10]  Richard E. Carson,et al.  11C-PBR28 imaging in multiple sclerosis patients and healthy controls: test-retest reproducibility and focal visualization of active white matter areas , 2015, European Journal of Nuclear Medicine and Molecular Imaging.

[11]  F. Schmidt,et al.  Measurement Error in Psychological Research: Lessons From 26 Research Scenarios , 1996 .

[12]  C. Begley,et al.  Drug development: Raise standards for preclinical cancer research , 2012, Nature.

[13]  U. Simonsohn Small Telescopes , 2014, Psychological science.

[14]  Terry K Koo,et al.  A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. , 2016, Journal Chiropractic Medicine.

[15]  G. Kanyongo,et al.  Reliability and Statistical Power: How Measurement Fallibility Affects Power and Required Sample Sizes for Several Parametric and Nonparametric Statistics , 2007 .

[16]  Jeih-San Liow,et al.  Cerebellum Can Serve As a Pseudo-Reference Region in Alzheimer Disease to Detect Neuroinflammation Measured with PET Radioligand Binding to Translocator Protein , 2015, The Journal of Nuclear Medicine.

[17]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[18]  H. Pashler,et al.  Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition 1 , 2009, Perspectives on psychological science : a journal of the Association for Psychological Science.

[19]  J. Nunnally Introduction to Psychological Measurement , 1970 .

[20]  D. Cicchetti Guidelines, Criteria, and Rules of Thumb for Evaluating Normed and Standardized Assessment Instruments in Psychology. , 1994 .

[21]  C. Halldin,et al.  Test–retest reliability of [11C]AZ10419369 binding to 5-HT1B receptors in human brain , 2014, European Journal of Nuclear Medicine and Molecular Imaging.

[22]  J. Carlin,et al.  Beyond Power Calculations , 2014, Perspectives on psychological science : a journal of the Association for Psychological Science.

[23]  Christer Halldin,et al.  Effect of a single dose of escitalopram on serotonin concentration in the non-human and human primate brain. , 2013, The international journal of neuropsychopharmacology.

[24]  Ester Cerin,et al.  Standard Error of Measurement , 2021, Encyclopedia of Autism Spectrum Disorders.

[25]  S. Kapur,et al.  Alterations in cortical and extrastriatal subcortical dopamine function in schizophrenia: systematic review and meta-analysis of imaging studies , 2013, British Journal of Psychiatry.

[26]  Y. Okubo,et al.  The 5-HT1B receptor - a potential target for antidepressant treatment , 2018, Psychopharmacology.

[27]  Z. Bhagwagar,et al.  The 5-HT1B receptor: a novel target for the pathophysiology of depression. , 2009, Current drug targets.

[28]  Brian A. Nosek,et al.  Power failure: why small sample size undermines the reliability of neuroscience , 2013, Nature Reviews Neuroscience.

[29]  D. Lakens Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses , 2015 .

[30]  Joseph L. Fleiss,et al.  Reliability of Measurement , 2011 .

[31]  Brian A. Nosek,et al.  Scientific Utopia , 2012, Perspectives on psychological science : a journal of the Association for Psychological Science.

[32]  Haley R Pipkins,et al.  Polyamine transporter potABCD is required for virulence of encapsulated but not nonencapsulated Streptococcus pneumoniae , 2017, PloS one.

[33]  J. Weir Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM. , 2005, Journal of strength and conditioning research.

[34]  G. Loewenstein,et al.  Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling , 2012, Psychological science.

[35]  Shinichi Nakagawa,et al.  Repeatability for Gaussian and non‐Gaussian data: a practical guide for biologists , 2010, Biological reviews of the Cambridge Philosophical Society.

[36]  Jim Shore,et al.  Fail Fast , 2004, IEEE Softw..

[37]  C. Halldin,et al.  Test–retest reproducibility of [11C]PBR28 binding to TSPO in healthy control subjects , 2015, European Journal of Nuclear Medicine and Molecular Imaging.

[38]  Pontus Plavén-Sigray,et al.  Assessment of simplified ratio-based approaches for quantification of PET [11C]PBR28 data , 2017, EJNMMI Research.

[39]  R. Todd Ogden,et al.  Accuracy and reliability of [11C]PBR28 specific binding estimated without the use of a reference region , 2018, NeuroImage.

[40]  S. Ferketich,et al.  Internal consistency estimates of reliability , 1990 .

[41]  Michael C. Frank,et al.  Estimating the reproducibility of psychological science , 2015, Science.

[42]  S. Levinson,et al.  WEIRD languages have misled us, too , 2010, Behavioral and Brain Sciences.

[43]  J. Fleiss,et al.  Intraclass correlations: uses in assessing rater reliability. , 1979, Psychological bulletin.

[44]  J. Bartko,et al.  On Various Intraclass Correlation Reliability Coefficients , 1976 .

[45]  A. Joshi,et al.  Statistical evaluation of test-retest studies in PET brain imaging , 2018, EJNMMI Research.

[46]  Mark Slifstein,et al.  The nature of dopamine dysfunction in schizophrenia and what this means for treatment. , 2012, Archives of general psychiatry.

[47]  Andrew Gelman,et al.  Measurement error and the replication crisis , 2017, Science.

[48]  J. Henrich,et al.  The weirdest people in the world? , 2010, Behavioral and Brain Sciences.

[49]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[50]  A. W. Kimball,et al.  Errors of the Third Kind in Statistical Consulting , 1957 .

[51]  Roger N Gunn,et al.  An 18-kDa Translocator Protein (TSPO) polymorphism explains differences in binding affinity of the PET radioligand PBR28 , 2011, Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism.

[52]  M. Palta,et al.  Standard Error of Measurement of 5 Health Utility Indexes across the Range of Health for Use in Estimating Reliability and Responsiveness , 2011, Medical decision making : an international journal of the Society for Medical Decision Making.

[53]  D. Lakens Equivalence Tests , 2017, Social psychological and personality science.

[54]  Tor D Wager,et al.  The relation between statistical power and inference in fMRI , 2017, PloS one.

[55]  D. Quintana Statistical considerations for reporting and planning heart rate variability case-control studies. , 2017, Psychophysiology.

[56]  Leif D. Nelson,et al.  False-Positive Psychology , 2011, Psychological science.

[57]  C. B. Colby The weirdest people in the world , 1973 .

[58]  Annchen R. Knodt,et al.  The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences , 2017, Behavior Research Methods.

[59]  Lars Farde,et al.  Distinct regional age effects on [11C]AZ10419369 binding to 5-HT1B receptors in the human brain , 2014, NeuroImage.

[60]  C. Ferguson,et al.  A Vast Graveyard of Undead Theories , 2012, Perspectives on psychological science : a journal of the Association for Psychological Science.

[61]  J. Fleiss The design and analysis of clinical experiments , 1987 .