External Validation of SpineNet, an Open-Source Deep Learning Model for Grading Lumbar Disk Degeneration MRI Features, Using the Northern Finland Birth Cohort 1966

Study Design. This is a retrospective observational study to externally validate a deep learning image classification model. Objective. Deep learning models such as SpineNet offer the possibility of automating the process of disk degeneration (DD) classification from magnetic resonance imaging (MRI). External validation is an essential step to their development. The aim of this study was to externally validate SpineNet predictions for DD using Pfirrmann classification and Modic changes (MCs) on data from the Northern Finland Birth Cohort 1966 (NFBC1966). Summary of Data. We validated SpineNet using data from 1331 NFBC1966 participants for whom both lumbar spine MRI data and consensus DD gradings were available. Materials and Methods. SpineNet returned Pfirrmann grade and MC presence from T2-weighted sagittal lumbar MRI sequences from NFBC1966, a data set geographically and temporally separated from its training data set. A range of agreement and reliability metrics were used to compare predictions with expert radiologists. Subsets of data that match SpineNet training data more closely were also tested. Results. Balanced accuracy for DD was 78% (77%–79%) and for MC 86% (85%–86%). Interrater reliability for Pfirrmann grading was Lin concordance correlation coefficient=0.86 (0.85–0.87) and Cohen κ=0.68 (0.67–0.69). In a low back pain subset, these reliability metrics remained largely unchanged. In total, 20.83% of disks were rated differently by SpineNet compared with the human raters, but only 0.85% of disks had a grade difference >1. Interrater reliability for MC detection was κ=0.74 (0.72–0.75). In the low back pain subset, this metric was almost unchanged at κ=0.76 (0.73–0.79). Conclusions. In this study, SpineNet has been benchmarked against expert human raters in the research setting. It has matched human reliability and demonstrates robust performance despite the multiple challenges facing model generalizability.

[1]  A. Jamaludin,et al.  External validation of the deep learning system “SpineNet” for grading radiological features of degeneration on MRIs of the lumbar spine , 2022, European Spine Journal.

[2]  Andrew Zisserman,et al.  SpineNetV2: Automated Detection, Labelling and Radiological Grading Of Clinical MR Scans , 2022, ArXiv.

[3]  J. Niinimäki,et al.  Association of lumbar disc degeneration with low back pain in middle age in the Northern Finland Birth Cohort 1966 , 2022, BMC Musculoskeletal Disorders.

[4]  Sakae Tanaka,et al.  Detailed Subphenotyping of Lumbar Modic Changes and Their Association with Low Back Pain in a Large Population-Based Study: The Wakayama Spine Study , 2021, Pain and Therapy.

[5]  Andrew Zisserman,et al.  Age and disc degeneration in low back pain: automated analysis enables a magnetic resonance imaging comparison of large cross-sectional cohorts of symptomatic and asymptomatic subjects. , 2021, medRxiv.

[6]  S. Keinänen-Kiukaanniemi,et al.  Cohort Profile: 46 years of follow-up of the Northern Finland Birth Cohort 1966 (NFBC1966) , 2021, International journal of epidemiology.

[7]  F. Cabitza,et al.  The importance of being external. methodological insights for the external validation of machine learning models in medicine , 2021, Comput. Methods Programs Biomed..

[8]  Federico Cabitza,et al.  The need to separate the wheat from the chaff in medical informatics: Introducing a comprehensive checklist for the (self)-assessment of medical AI studies , 2021, Int. J. Medical Informatics.

[9]  H. Wilke,et al.  Intelligence-Based Spine Care Model: A New Era of Research and Clinical Decision-Making , 2020, Global spine journal.

[10]  Giorgio Visani,et al.  Metrics for Multi-Class Classification: an Overview , 2020, ArXiv.

[11]  Frank Niemeyer,et al.  A Deep Learning Model for the Accurate and Reliable Classification of Disc Degeneration Based on MRI Data , 2020, Investigative radiology.

[12]  Josh C Tan,et al.  Accurate prediction of lumbar microdecompression level with an automated MRI grading system , 2020, Skeletal Radiology.

[13]  J. Niinimäki,et al.  Association Between Modic Changes and Low Back Pain in Middle Age: A Northern Finland Birth Cohort Study. , 2020, Spine.

[14]  J. Niinimäki,et al.  Lumbosacral transitional vertebrae are associated with lumbar degeneration: retrospective evaluation of 3855 consecutive abdominal CT scans , 2020, European Radiology.

[15]  J. Hartvigsen,et al.  Degenerative findings in lumbar spine MRI: an inter-rater reliability study involving three raters , 2020, Chiropractic & Manual Therapies.

[16]  Kathryn J Fowler,et al.  Assessing Radiology Research on Artificial Intelligence: A Brief Guide for Authors, Reviewers, and Readers-From the Radiology Editorial Board. , 2019, Radiology.

[17]  E. Topol,et al.  A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. , 2019, The Lancet. Digital health.

[18]  Sepand Haghighi,et al.  PyCM: Multiclass confusion matrix library in Python , 2018, J. Open Source Softw..

[19]  S. Park,et al.  Methodologic Guide for Evaluating Clinical Performance and Effect of Artificial Intelligence Technology for Medical Diagnosis and Prediction. , 2018, Radiology.

[20]  William F Lavelle,et al.  on MRI: The Impact of Surgeon Experience Surgeon Reliability for the Assessment of Lumbar Spinal Stenosis , 2018 .

[21]  Andrew Zisserman,et al.  ISSLS PRIZE IN BIOENGINEERING SCIENCE 2017: Automation of reading of radiological features from magnetic resonance images (MRIs) of the lumbar spine without human intervention is comparable with an expert radiologist , 2017, European Spine Journal.

[22]  K. Luk,et al.  Phenotype profiling of Modic changes of the lumbar spine and its association with other MRI phenotypes: a large-scale population-based study. , 2015, The spine journal : official journal of the North American Spine Society.

[23]  S. Halabi,et al.  Systematic Literature Review of Imaging Features of Spinal Degeneration in Asymptomatic Populations , 2015, American Journal of Neuroradiology.

[24]  M. Battié,et al.  Disc degeneration-related clinical phenotypes , 2014, European Spine Journal.

[25]  J. Niinimäki,et al.  Vertebral endplate change as a feature of intervertebral disc degeneration: a heritability study , 2014, European Spine Journal.

[26]  Bernadette A. Thomas,et al.  Years lived with disability (YLDs) for 1160 sequelae of 289 diseases and injuries 1990–2010: a systematic analysis for the Global Burden of Disease Study 2010 , 2012, The Lancet.

[27]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[28]  J. Sim,et al.  The kappa statistic in reliability studies: use, interpretation, and sample size requirements. , 2005, Physical therapy.

[29]  C. Pfirrmann,et al.  Magnetic Resonance Classification of Lumbar Intervertebral Disc Degeneration , 2001, Spine.

[30]  J F Burnum,et al.  The misinformation era: the fall of the medical record. , 1989, Annals of internal medicine.

[31]  James N Weinstein,et al.  Lumbar spine: reliability of MR imaging findings. , 2009, Radiology.

[32]  Jan Stam,et al.  Observer variation in MRI evaluation of patients suspected of lumbar disk herniation. , 2005, AJR. American journal of roentgenology.