Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM.

Reliability, the consistency of a test or measurement, is frequently quantified in the movement sciences literature. A common metric is the intraclass correlation coefficient (ICC). In addition, the SEM, which can be calculated from the ICC, is also frequently reported in reliability studies. However, there are several versions of the ICC, and confusion exists in the movement sciences regarding which ICC to use. Further, the utility of the SEM is not fully appreciated. In this review, the basics of classic reliability theory are addressed in the context of choosing and interpreting an ICC. The primary distinction between ICC equations is argued to be one concerning the inclusion (equations 2,1 and 2,k) or exclusion (equations 3,1 and 3,k) of systematic error in the denominator of the ICC equation. Inferential tests of mean differences, which are performed in the process of deriving the necessary variance components for the calculation of ICC values, are useful to determine if systematic error is present. If so, the measurement schedule should be modified (removing trials where learning and/or fatigue effects are present) to remove systematic error, and ICC equations that only consider random error may be safely used. The use of ICC values is discussed in the context of estimating the effects of measurement error on sample size, statistical power, and correlation attenuation. Finally, calculation and application of the SEM are discussed. It is shown how the SEM and its variants can be used to construct confidence intervals for individual scores and to determine the minimal difference needed to be exhibited for one to be confident that a true change in performance of an individual has occurred.

[1]  J. Brožek,et al.  Components of variation and the consistency of repeated measurements. , 1947, Research quarterly.

[2]  Howard W. Alexander,et al.  The estimation of reliability when several trials are available , 1947, Psychometrika.

[3]  L. S. Feldt,et al.  Estimation of the Reliability of Skill Tests , 1958 .

[4]  Marie R. Liba,et al.  A Trend Test as a Preliminary to Reliability Estimation , 1962 .

[5]  W. Kroll A Note on the Coefficient of Intraclass Correlation as an Estimate of Reliability , 1962 .

[6]  J. Bartko The Intraclass Correlation Coefficient as a Measure of Reliability , 1966, Psychological reports.

[7]  T. Baumgartner Estimating reliability when all test trials are administered on the same day. , 1969, Research quarterly.

[8]  J. Bartko,et al.  On Various Intraclass Correlation Reliability Coefficients , 1976 .

[9]  M. J. Safrit,et al.  Comparison of two nonparametric methods for estimating the reliability of motor performance tests. , 1977, Research quarterly.

[10]  F. J. Dudek The Continuing Misinterpretation of the Standard Error of Measurement , 1979 .

[11]  J. Fleiss,et al.  Intraclass correlations: uses in assessing rater reliability. , 1979, Psychological bulletin.

[12]  R. Downey,et al.  Intraclass Correlations: There's More There Than Meets the Eye , 1983 .

[13]  J. Fleiss The design and analysis of clinical experiments , 1987 .

[14]  D. Altman,et al.  STATISTICAL METHODS FOR ASSESSING AGREEMENT BETWEEN TWO METHODS OF CLINICAL MEASUREMENT , 1986, The Lancet.

[15]  J M Bland,et al.  Statistical methods for assessing agreement between two methods of clinical measurement , 1986 .

[16]  P. Burney,et al.  On measuring repeatability of data from self-administered questionnaires. , 1987, International journal of epidemiology.

[17]  G. Guyatt,et al.  Measuring change over time: assessing the usefulness of evaluative instruments. , 1987, Journal of chronic diseases.

[18]  P. Stratford Reliability: consistency or differentiating among subjects? , 1989, Physical therapy.

[19]  T. M. Wood,et al.  Measurement Concepts in Physical Education and Exercise Science , 1989 .

[20]  D. Streiner,et al.  Health Measurement Scales: A practical guide to thier development and use , 1989 .

[21]  S. Chinn Statistics in respiratory medicine. 2. Repeatability and method comparison. , 1991, Thorax.

[22]  R. Traub,et al.  NCME Instructional Module: Understanding Reliability. , 1991 .

[23]  R. Maughan Research Methods in Physical Activity. 2nd Edn , 1991 .

[24]  G. Keppel Design and analysis: A researcher's handbook, 3rd ed. , 1991 .

[25]  L. Portney,et al.  Foundations of Clinical Research , 1993 .

[26]  J R Morrow,et al.  How "significant" is your reliability? , 1993, Research quarterly for exercise and sport.

[27]  J. Harlaar,et al.  The application of generalizability theory to reliability assessment: an illustration using isometric force measurements. , 1993, Physical therapy.

[28]  M. Eliasziw,et al.  Statistical methodology for the concurrent assessment of interrater and intrarater reliability: using goniometric measurements as an example. , 1994, Physical therapy.

[29]  R. Charter Revisiting the Standard Errors of Measurement, Estimate, and Prediction and Their Application to Test Scores , 1996 .

[30]  A. Verbeek,et al.  A criterion for stability of the motor function of the lower extremity in stroke patients using the Fugl-Meyer Assessment Scale. , 1996, Scandinavian journal of rehabilitation medicine.

[31]  K. McGraw,et al.  Forming inferences about some intraclass correlation coefficients. , 1996 .

[32]  C. Goldsmith,et al.  Use of the standard error as a reliability index of interest: an applied example using elbow flexor strength data. , 1997, Physical therapy.

[33]  D. Spiegelhalter,et al.  Setting the minimal metrically detectable change on disability rating scales. , 1997, Archives of physical medicine and rehabilitation.

[34]  R. Charter,et al.  Methodological commentary: Effect of measurement error on tests of statistical significance , 1997 .

[35]  R. Charter Effect on measurement error on tests of statistical significance. , 1997, Journal of clinical and experimental neuropsychology.

[36]  P. Shrout Measurement reliability and agreement in psychiatry , 1998, Statistical methods in medical research.

[37]  Jennifer Keating,et al.  Unreliable inferences from reliable measurements. , 1998, The Australian journal of physiotherapy.

[38]  D. P. Nichols Choosing an intraclass correlation coefficient , 1998 .

[39]  S. Walter,et al.  Sample size and optimal designs for reliability studies. , 1998, Statistics in medicine.

[40]  G Atkinson,et al.  Statistical Methods For Assessing Measurement Error (Reliability) in Variables Relevant to Sports Medicine , 1998, Sports medicine.

[41]  T. Matyas,et al.  When is a change a genuine change? A clinically meaningful interpretation of grip strength measurements in healthy and disabled women. , 1999, Journal of hand therapy : official journal of the American Society of Hand Therapists.

[42]  J. Bartko,et al.  Penny-wise and pound-foolish: the impact of measurement error on sample size requirements in clinical trials , 2000, Biological Psychiatry.

[43]  M. Looney When Is the Intraclass Correlation Coefficient Misleading? , 2000 .

[44]  M. Bédard,et al.  Assessing reproducibility of data obtained with instruments based on continuous measurements. , 2000, Experimental aging research.

[45]  W G Hopkins,et al.  Measures of Reliability in Sports Medicine and Science , 2000, Sports medicine.

[46]  Stephen T. Holgate,et al.  Reliability: What is it, and how is it measured? , 2000 .

[47]  T. Baumgartner Estimating the Stability Reliability of a Score , 2000 .

[48]  L. S. Feldt,et al.  Meaning of Reliability in Terms of Correct and Incorrect Clinical Decisions: The Art of Decision Making is Still Alive , 2001, Journal of clinical and experimental neuropsychology.

[49]  Tim Olds Five errors about error. , 2002, Journal of science and medicine in sport.

[50]  John Ludbrook,et al.  Statistical Techniques For Comparing Measurers And Methods Of Measurement: A Critical Review , 2002, Clinical and experimental pharmacology & physiology.

[51]  Theo Gasser,et al.  Assessing intrarater, interrater and test–retest reliability of continuous measurements , 2002, Statistics in medicine.

[52]  L. S. Feldt,et al.  The Importance of Reliability as It Relates to True Score Confidence Intervals , 2002 .

[53]  J. P. Morgan,et al.  Design and Analysis: A Researcher's Handbook , 2005, Technometrics.