Improving prediction model systematic review methodology: Letter to the Editor

Dear Editor, In their recently published paper, Seow et al1 carried out a systematic review of musculoskeletal injury prediction models in professional sport and military special forces. Their review encompassed a comprehensive search that included both conference and published papers, used a standardized musculoskeletal injury definition that was informed by the literature, and included both statistical and machine learningbased models. Nevertheless, we have a number of concerns regarding the conduct and reporting of some aspects of the study that limit the usefulness of their findings. Our first point relates to how the studies were appraised. While the authors should be commended on assessing each study for risk of bias, the Newcastle Ottawa Scale (NOS) is not the correct tool to do this. The NOS is a generic tool designed to assess the quality of nonrandomized studies such as casecontrol and cohort studies— and while prediction model studies often use cohort design, the tool includes no specific assessment of analysis issues relating to the development or validation of a prediction model. Hence, the NOS is a blunt instrument to assess risk of bias in these studies. The tool that should have been used to assess the risk of bias in the review by Seow et al1 is the Prediction model Risk Of Bias Assessment Tool (PROBAST),2 which includes 20 signaling questions over four domains (participants, predictors, outcome, and analysis), to cover key aspects of prediction model studies. Furthermore, when designing a systematic review of prediction model studies, the Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies (CHARMS) checklist3 provides detailed guidance to help authors in developing their systematic review questions relating to prediction models, extracting pertinent prediction model data, and appraising prediction model studies.3 Had these more relevant tools been used, and indeed, the review process outlined by the Cochrane Prognosis Methods Group followed4; it would have enabled the authors to better appraise and utilize the included prediction model studies in their review. In particular, it would have given more depth and clarity, and allowed enhanced identification of any strength in the existing evidence and also highlighted particular areas of conduct and reporting that should be improved upon in future studies. While the authors extracted and reported the discrimination performance (such as area under the curve) of models that were included, we note that there was no comment on model calibration— an essential component of model performance.4,5 Calibration is the agreement between probabilities derived from the model versus those actually observed within the data6 and is important in understanding the accuracy of the predictions from the model.7,8 This omission could have been addressed at the design stage using the aforementioned CHARMS checklist. Consequently, the authors have missed an important opportunity to report on this critical aspect of prediction model performance assessment and therefore presented readers with incomplete information on the usefulness of the included prediction models. Furthermore, any omission of calibration in the primary studies will have a direct and negative impact on the risk of bias assessment. A related concern is that the authors do not explain how they extracted performance estimates, and whether they used the extensive tools of Debray et al9 to help extract estimates (eg, the area under the curve and its confidence interval) when these were not reported directly, in order to maximize the information available for review. Whether performance statistics were adjusted for optimism was also not reported,10 and clinical utility measures (eg, net benefit11) were not discussed. We were also concerned with the authors’ expectations regarding the handling class imbalance using overor undersampling to create a more balanced data set. Data are said to be imbalanced when there are fewer individuals in the data set with the outcome (compared to those without the outcome). In the context of classification, this can indeed be a problem, for example, when evaluating classification accuracy (ie, proportion of correct classifications) in the sense that incorrectly misclassifying individuals with the outcome in a highly imbalanced data set could yield high accuracy— as the larger nonoutcome group will dominate the calculation of overall accuracy.12 However, in the context of prediction (the aim of the review by Seow et al1), class imbalance is a feature of the

[1]  Martha Sajatovic,et al.  Clinical Prediction Models , 2013 .

[2]  N. Obuchowski,et al.  Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures , 2010, Epidemiology.

[3]  L. Hooft,et al.  A guide to systematic review and meta-analysis of prediction model performance , 2017, British Medical Journal.

[4]  Richard D Riley,et al.  A framework for meta-analysis of prediction model studies with binary and time-to-event outcomes , 2018, Statistical methods in medical research.

[5]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[6]  G. Collins,et al.  PROBAST : A Tool to Assess the Risk of Bias and Applicability of PredictionModel Studies , 2018 .

[7]  Maarten van Smeden,et al.  Sample size for binary logistic prediction models: Beyond events per variable criteria , 2018, Statistical methods in medical research.

[8]  Richard D Riley,et al.  Minimum sample size for developing a multivariable prediction model: PART II ‐ binary and time‐to‐event outcomes , 2018, Statistics in medicine.

[9]  G. Collins,et al.  Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies: The CHARMS Checklist , 2014, PLoS medicine.

[10]  Dexter Seow,et al.  Prediction models for musculoskeletal injuries in professional sporting activities: A systematic review , 2020 .

[11]  Gary S Collins,et al.  Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration , 2015, Annals of Internal Medicine.

[12]  Richard D Riley,et al.  Penalization and shrinkage methods produced unreliable clinical prediction models especially when sample size was small , 2020, Journal of clinical epidemiology.

[13]  G. Collins,et al.  External validation of multivariable prediction models: a systematic review of methodological conduct and reporting , 2014, BMC Medical Research Methodology.

[14]  Maarten van Smeden,et al.  Calibration: the Achilles heel of predictive analytics , 2019, BMC Medicine.

[15]  M. Kenward,et al.  Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls , 2009, BMJ : British Medical Journal.

[16]  D. Rao,et al.  A systematic review of multi-level stigma interventions: state of the science and future directions , 2019, BMC Medicine.

[17]  Ewout W Steyerberg,et al.  Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests , 2016, British Medical Journal.