In this issue of npj Digital Medicine, Abramoff and colleagues report the findings from a prospective study that evaluates the performance of a diabetic retinopathy diagnostic system (IDx-DR) in a primary care setting. This represents an important clinical milestone as, in April 2018, these results were used to form the basis for FDA approval of the system, thus becoming the first fully autonomous AI-based system approved for marketing in the USA. Given the potentially transformative potential of AI for healthcare (in particular a technique referred to as “deep learning”)—but also its associated hype—this lays an important foundation for future translation of such technologies to routine clinical practice. Deep learning uses artificial neural networks—so-called because of their superficial resemblance to biological neural networks—as computational models to discover intricate structure in large, high-dimensional datasets. Although first espoused in the 1980s, deep learning has come to prominence in recent years, driven in large part by the power of graphics processing units (GPUs) originally developed for video gaming, cloud computing, and the increasing availability of large, carefully annotated datasets. Since 2012, deep learning has brought seismic changes to the technology industry, with major breakthroughs in areas as diverse as image and speech recognition, natural language translation, robotics, and even self-driving cars. In 2015, Scientific American listed deep learning as one of their ‘world changing’ ideas for the year. Deep learning is particularly well suited to image classification tasks and so has huge potential in medical imaging applications— scans, slides, skin lesions and the patterns in medical practice that occur frequently and are associated with screening, triage, diagnosis, and monitoring. A number of recent research studies have demonstrated this potential in multiple domains, albeit in retrospective in silico settings. The work reported by Abramoff et al. is an important milestone as the first of its kind to be performed in a prospective real-world clinical environment, and using a product that will be commercially available rather than a research prototype. The need for external validation studies is well recognized in the machine learning community; however, there may be less awareness of the additional specific value provided by a prospective clinical study, as well as the time, effort, and considerable costs that such studies entail. Prospective, noninterventional studies, such as that described by Abramoff and colleagues, will likely be fundamental to addressing questions about automated diagnosis efficacy. However, such studies will not address the issue of clinical effectiveness—do patients directly benefit from the use of such AI systems? In the case of diabetic retinopathy, the question might be: do patients ultimately have good—or at least non-inferior—visual outcomes when this system is used? This is not a trivial point—computer aided detection (CAD) systems for mammography were approved by the FDA in 1998, and by 2008 74% of all screening mammograms in the Medicare population were interpreted using this technology. However, nearly 20 years later a large study concluded “CAD does not improve diagnostic accuracy of mammography and may result in missed cancers. These results suggest that insurers pay more for computer-aided detection with no established benefit to women.” To properly address this issue, prospective interventional studies should be required. Of course, such randomized clinical trials may not be feasible or warranted in every case; however, it will be incumbent on the clinical community to engage with this question. A further important point is that, historically, diagnostic accuracy studies have often been suboptimally or poorly reported. With the likely further clinical translation of AI systems, it will become increasingly important for STARD, and other trial reporting guidelines, to be both followed and regularly updated. The clinical research community has also got blind spots. In particular, there is a lack of awareness of the so-called 'AI Chasm', that is the gulf between developing a scientifically sound algorithm and its use in any meaningful real-world applications. It is one thing to develop an algorithm that works well on a small dataset from a specific population, it is quite another to develop one that will generalize to other populations and across different imaging modalities. There is also a large gulf between the experimental code produced for a proof-of-concept research study, and the eventual code to be used in a product with regulatory approvals. The latter constitutes a medical device and so must typically be rewritten from the ground up, with a quality management system in place, and in compliance with Good Manufacturing Practice. The time, expertise, and expense associated with this can be considerable and likely not possible for clinicians without an industry partner or other significant commercial support. It is also important to highlight that many aspects of the regulatory processes for AI are still evolving and that there is uncertainty about the implications of this, both for planning of clinical trials and commercial development. Firstly, it is worth explicitly pointing out a prevalent misconception about AI diagnostic systems. Although these systems typically learn by being trained on large amounts of labelled images, at some point this process is stopped and diagnostic thresholds are set. In the work by Abramoff and colleagues, the software was locked prior to the clinical trial—after this point, the software behaves in a similar fashion to non-AI diagnostic systems. That is to say the auto-didactic aspect of the algorithm is no longer doing ‘on the job’ learning. It may be some years before clinical trial methodologies and regulatory frameworks have evolved to deal with algorithms capable of learning on a case-by-case basis in a real-world setting. Secondly, it is worth highlighting that the IDxDR was reviewed under the FDA’s De Novo premarket review pathway. This is a regulatory pathway for lowto moderate-risk devices that are novel and for which there is no legally marketed device. The bar for subsequent approval of diabetic retinopathy AI diagnostic systems is likely to be higher. While this study is undoubtedly a milestone, and an important benchmark for future research, it is also important to touch on
[1]
Lotty Hooft,et al.
Reporting quality of diagnostic accuracy studies: a systematic review and meta-analysis of investigations on adherence to STARD
,
2013,
Evidence-Based Medicine.
[2]
Daniel S. Kermany,et al.
Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning
,
2018,
Cell.
[3]
C. Lehman,et al.
Diagnostic Accuracy of Digital Screening Mammography With and Without Computer-Aided Detection.
,
2015,
JAMA internal medicine.
[4]
Mustafa Suleyman,et al.
Key challenges for delivering clinical impact with artificial intelligence
,
2019,
BMC Medicine.
[5]
Harshana Liyanage,et al.
Artificial Intelligence in Primary Health Care: Perceptions, Issues, and Challenges
,
2019,
Yearbook of Medical Informatics.
[6]
Geoffrey E. Hinton,et al.
Deep Learning
,
2015,
Nature.
[7]
M. Abràmoff,et al.
Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices
,
2018,
npj Digital Medicine.
[8]
P. Scanlon,et al.
Attitudes, access and anguish: a qualitative interview study of staff and patients’ experiences of diabetic retinopathy screening
,
2014,
BMJ Open.
[9]
Anne E Carpenter,et al.
Opportunities and obstacles for deep learning in biology and medicine
,
2017,
bioRxiv.
[10]
Leo Anthony Celi,et al.
The “inconvenient truth” about AI in healthcare
,
2019,
npj Digital Medicine.
[11]
Matthias Becker,et al.
Scalable Prediction of Acute Myeloid Leukemia Using High-Dimensional Machine Learning and Blood Transcriptomics
,
2019,
iScience.
[12]
C. Carey.
United States food and drug administration
,
2020
.
[13]
Subhashini Venugopalan,et al.
Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs.
,
2016,
JAMA.