Using the bootstrap to establish statistical significance for relative validity comparisons among patient-reported outcome measures

BackgroundRelative validity (RV), a ratio of ANOVA F-statistics, is often used to compare the validity of patient-reported outcome (PRO) measures. We used the bootstrap to establish the statistical significance of the RV and to identify key factors affecting its significance.MethodsBased on responses from 453 chronic kidney disease (CKD) patients to 16 CKD-specific and generic PRO measures, RVs were computed to determine how well each measure discriminated across clinically-defined groups of patients compared to the most discriminating (reference) measure. Statistical significance of RV was quantified by the 95% bootstrap confidence interval. Simulations examined the effects of sample size, denominator F-statistic, correlation between comparator and reference measures, and number of bootstrap replicates.ResultsThe statistical significance of the RV increased as the magnitude of denominator F-statistic increased or as the correlation between comparator and reference measures increased. A denominator F-statistic of 57 conveyed sufficient power (80%) to detect an RV of 0.6 for two measures correlated at r = 0.7. Larger denominator F-statistics or higher correlations provided greater power. Larger sample size with a fixed denominator F-statistic or more bootstrap replicates (beyond 500) had minimal impact.ConclusionsThe bootstrap is valuable for establishing the statistical significance of RV estimates. A reasonably large denominator F-statistic (F > 57) is required for adequate power when using the RV to compare the validity of measures with small or moderate correlations (r < 0.7). Substantially greater power can be achieved when comparing measures of a very high correlation (r > 0.9).

[1]  R. Hays,et al.  Development of the Kidney Disease Quality of Life (KDQOLTM) Instrument , 1994, Quality of Life Research.

[2]  D. Hart,et al.  Discriminant Validity and Relative Precision for Classifying Patients With Nonspecific Neck and Back Pain by Anatomic Pain Patterns , 2003, Spine.

[3]  R. Fitzpatrick,et al.  Rasch scoring of outcomes of total hip replacement. , 2003, Journal of clinical epidemiology.

[4]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[5]  B Efron,et al.  Statistical Data Analysis in the Computer Age , 1991, Science.

[6]  C. Sherbourne,et al.  The MOS 36-Item Short-Form Health Survey (SF-36) , 1992 .

[7]  Anastasia E. Raczek,et al.  The validity and relative precision of MOS short- and long-form health status scales and Dartmouth COOP charts. Results from the Medical Outcomes Study. , 1992, Medical care.

[8]  J. Ware,et al.  A 12-Item Short-Form Health Survey: construction of scales and preliminary tests of reliability and validity. , 1996, Medical care.

[9]  N. Kutner,et al.  Quality of life of patients with end-stage renal disease. , 1985, The New England journal of medicine.

[10]  Anastasia E. Raczek,et al.  Comparison of Rasch and summated rating scales constructed from SF-36 physical functioning items in seven countries: results from the IQOLA Project. International Quality of Life Assessment. , 1998, Journal of clinical epidemiology.

[11]  R. Hays,et al.  Commentary on using the SF-36 or MOS-HIV in studies of persons with HIV disease , 2003, Health and quality of life outcomes.

[12]  M. Liang,et al.  Comparisons of Five Health Status Instruments for Orthopedic Evaluation , 1990, Medical care.

[13]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[14]  A Ralph Henderson,et al.  The bootstrap: a technique for data-driven statistics. Using computer-intensive analyses to explore experimental data. , 2005, Clinica chimica acta; international journal of clinical chemistry.

[15]  J Carpenter,et al.  Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. , 2000, Statistics in medicine.

[16]  Robert Tibshirani,et al.  Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy , 1986 .

[17]  Anthony C. Davison,et al.  Bootstrap Methods and Their Application , 1998 .

[18]  R. Hays,et al.  Comparison of a generic to disease-targeted health-related quality-of-life measures for multiple sclerosis. , 1997, Journal of clinical epidemiology.

[19]  D. W. Murray,et al.  A comparison of Rasch with Likert scoring to discriminate between patients' evaluations of total hip replacement surgery , 2004, Quality of Life Research.

[20]  D. Manninen,et al.  The quality of life of hemodialysis recipients treated with recombinant human erythropoietin. Cooperative Multicenter EPO Clinical Trial Group. , 1990, JAMA.

[21]  B. Efron Better Bootstrap Confidence Intervals , 1987 .

[22]  C. McHorney,et al.  The MOS 36‐Item Short‐Form Health Survey (SF‐36): II. Psychometric and Clinical Tests of Validity in Measuring Physical and Mental Health Constructs , 1993, Medical care.

[23]  P. Fayers,et al.  Quality of Life: The assessment, analysis and interpretation of patient-reported outcomes , 2007 .

[24]  Ray Fitzpatrick,et al.  Comparing Alternative Rasch-Based Methods vs Raw Scores in Measuring Change in Health , 2004, Medical care.

[25]  R. Hansen,et al.  Predialysis chronic kidney disease: evaluation of quality of life in clinic patients receiving comprehensive anemia care. , 2009, Research in social & administrative pharmacy : RSAP.

[26]  P. Stratford,et al.  Simulated computerized adaptive tests for measuring functional status were efficient with good discriminant validity in patients with hip, knee, or foot/ankle impairments. , 2005, Journal of clinical epidemiology.

[27]  C. McHorney,et al.  Evaluation of the MOS SF-36 Physical Functioning Scale (PF-10): II. Comparison of relative precision using Likert and Rasch scoring methods. , 1997, Journal of clinical epidemiology.

[28]  J. Ware,et al.  The SF-36 Health Survey as a generic outcome measure in clinical trials of patients with osteoarthritis and rheumatoid arthritis: relative validity of scales in relation to clinical measures of arthritis severity. , 1999, Medical care.

[29]  K. Cook,et al.  Simulated computerized adaptive test for patients with shoulder impairments was efficient and produced valid measures of function. , 2006, Journal of clinical epidemiology.

[30]  Howard B. Lee,et al.  Foundations of Behavioral Research , 1973 .

[31]  B. Efron,et al.  Bootstrap confidence intervals , 1996 .

[32]  A. Dowson,et al.  Applications of computerized adaptive testing (CAT) to the assessment of headache impact , 2003, Quality of Life Research.

[33]  H. Lindman Analysis of variance in complex experimental designs , 1974 .

[34]  D. Manninen,et al.  The Quality of Life of Hemodialysis Recipients Treated With Recombinant Human Erythropoietin , 1990 .

[35]  Jeffrey A. Johnson,et al.  Relative Efficiency of the EQ-5D, HUI2, and HUI3 Index Scores in Measuring Health Burden of Chronic Medical Conditions in a Population Health Survey in the United States , 2009, Medical care.

[36]  G. Box Some Theorems on Quadratic Forms Applied in the Study of Analysis of Variance Problems, II. Effects of Inequality of Variance and of Correlation Between Errors in the Two-Way Classification , 1954 .