Expert Discussions Improve Comprehension of Difficult Cases in Medical Image Assessment

Medical data labeling workflows critically depend on accurate assessments from human experts. Yet human assessments can vary markedly, even among medical experts. Prior research has demonstrated benefits of labeler training on performance. Here we utilized two types of labeler training feedback: highlighting incorrect labels for difficult cases ("individual performance" feedback), and expert discussions from adjudication of these cases. We presented ten generalist eye care professionals with either individual performance alone, or individual performance and expert discussions from specialists. Compared to performance feedback alone, seeing expert discussions significantly improved generalists' understanding of the rationale behind the correct diagnosis while motivating changes in their own labeling approach; and also significantly improved average accuracy on one of four pathologies in a held-out test set. This work suggests that image adjudication may provide benefits beyond developing trusted consensus labels, and that exposure to specialist discussions can be an effective training intervention for medical diagnosis.

[1]  Stuart Anderson,et al.  Problems of data mobility and reuse in the provision of computer-based training for screening mammography , 2012, CHI.

[2]  Andrew Y. Ng,et al.  Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks , 2017, ArXiv.

[3]  Jon M. Kleinberg,et al.  Direct Uncertainty Prediction for Medical Second Opinions , 2018, ICML.

[4]  Lydia B. Chilton,et al.  MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy , 2016, HCOMP.

[5]  P. Lichter Variability of expert observers in evaluating the optic disc. , 1976, Transactions of the American Ophthalmological Society.

[6]  Y. Buys,et al.  Workforce supply of eye care providers in Canada: optometrists, ophthalmologists, and subspecialty ophthalmologists. , 2015, Canadian journal of ophthalmology. Journal canadien d'ophtalmologie.

[7]  Justin Dauwels,et al.  Interictal epileptiform discharge characteristics underlying expert interrater agreement , 2017, Clinical Neurophysiology.

[8]  Andrew Lim,et al.  Ambiguity-aware AI Assistants for Medical Data Analysis , 2020, CHI.

[9]  Yun Liu,et al.  How to develop machine learning models for healthcare , 2019, Nature Materials.

[10]  Z. Pradhan,et al.  Prostaglandin agonist effect on matrix metalloproteinase aqueous levels in glaucoma patients. , 2015, Canadian journal of ophthalmology. Journal canadien d'ophtalmologie.

[11]  Jon M. Kleinberg,et al.  Direct Uncertainty Prediction with Applications to Healthcare , 2018, ArXiv.

[12]  L. Shi,et al.  Telemedicine for detecting diabetic retinopathy: a systematic review and meta-analysis , 2015, British Journal of Ophthalmology.

[13]  Tim Dornan,et al.  Clinical teachers and problem‐based learning: a phenomenological study , 2005, Medical education.

[14]  Jonathan Krause,et al.  Remote Tool-Based Adjudication for Grading Diabetic Retinopathy , 2019, Translational vision science & technology.

[15]  Frazier T. Stevenson,et al.  Comparing Problem-Based Learning with Case-Based Learning: Effects of a Major Curricular Shift at Two Institutions , 2007, Academic medicine : journal of the Association of American Medical Colleges.

[16]  Jonathan Krause,et al.  Deep Learning and Glaucoma Specialists: The Relative Importance of Optic Disc Features to Predict Glaucoma Referral in Fundus Photographs. , 2019, Ophthalmology (Rochester, Minn.).

[17]  Andrew Lim,et al.  Understanding Expert Disagreement in Medical Data Analysis through Structured Adjudication , 2019, Proc. ACM Hum. Comput. Interact..

[18]  B. Dall,et al.  Can the NHS Breast Screening Programme afford not to double read screening mammograms? , 2003, Clinical radiology.

[19]  Michael Lawrence Barnett,et al.  Comparative Accuracy of Diagnosis by Collective Intelligence of Multiple Physicians vs Individual Physicians , 2019, JAMA network open.

[20]  Mausam,et al.  To Re(label), or Not To Re(label) , 2014, HCOMP.

[21]  Lora Aroyo,et al.  Crowdsourcing Ground Truth for Medical Relation Extraction , 2017, ACM Trans. Interact. Intell. Syst..

[22]  Lora Aroyo,et al.  Domain-Independent Quality Measures for Crowd Truth Disagreement , 2013, DeRiVE@ISWC.

[23]  M. Schaekermann Resolvable vs. Irresolvable Ambiguity: A New Hybrid Framework for Dealing with Uncertain Ground Truth , 2016 .

[24]  Lydia B. Chilton,et al.  Cicero: Multi-Turn, Contextual Argumentation for Accurate Crowdsourcing , 2018, CHI.

[25]  Geoffrey E. Hinton,et al.  Who Said What: Modeling Individual Labelers Improves Classification , 2017, AAAI.

[26]  D. Dunning The Dunning–Kruger Effect , 2011 .

[27]  E. Finkelstein,et al.  Development and Validation of a Deep Learning System for Diabetic Retinopathy and Related Eye Diseases Using Retinal Images From Multiethnic Populations With Diabetes , 2017, JAMA.

[28]  Amit Shah,et al.  Comparison of case-based learning and traditional lectures in physiology among first year undergraduate medical students , 2017 .

[29]  G. Corrado,et al.  Using a Deep Learning Algorithm and Integrated Gradients Explanation to Assist Grading for Diabetic Retinopathy. , 2019, Ophthalmology.

[30]  A. Cavallerano,et al.  Teleretinal Imaging to Screen for Diabetic Retinopathy in the Veterans Health Administration , 2008, Journal of diabetes science and technology.

[31]  J. Epstein,et al.  A web‐based tutorial improves practicing pathologists' Gleason grading of images of prostate carcinoma specimens obtained by needle biopsy , 2000, Cancer.

[32]  Jorge A Cuadros,et al.  EyePACS: An Adaptable Telemedicine System for Diabetic Retinopathy Screening , 2009, Journal of diabetes science and technology.

[33]  W. Jagust,et al.  Validation of consensus panel diagnosis in dementia. , 2010, Archives of neurology.

[34]  Bilson J. L. Campana,et al.  Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program , 2019, npj Digital Medicine.

[35]  Andrew Lim,et al.  Expert Disagreement in Sequential Labeling: A Case Study on Adjudication in Medical Time Series Analysis , 2018, SAD/CrowdBias@HCOMP.

[36]  R. Rosenberg,et al.  The American Academy of Sleep Medicine inter-scorer reliability program: sleep stage scoring. , 2013, Journal of clinical sleep medicine : JCSM : official publication of the American Academy of Sleep Medicine.

[37]  Daniel Shu Wei Ting MMed,et al.  Diabetic retinopathy: global prevalence, major risk factors, screening practices and public health challenges: a review , 2016 .

[38]  Poul Jennum,et al.  Neural network analysis of sleep stages enables efficient diagnosis of narcolepsy , 2017, Nature Communications.

[39]  Edith Law,et al.  Resolvable vs. Irresolvable Disagreement , 2018, Proc. ACM Hum. Comput. Interact..

[40]  Benjamin D. Hennig,et al.  The global inverse care law: a distorted map of blindness , 2012, British Journal of Ophthalmology.

[41]  Deniz Erdogmus,et al.  Plus Disease in Retinopathy of Prematurity: Improving Diagnosis by Ranking Disease Severity and Using Quantitative Image Analysis. , 2016, Ophthalmology.

[42]  D. Ting,et al.  Diabetic retinopathy: global prevalence, major risk factors, screening practices and public health challenges: a review , 2016, Clinical & experimental ophthalmology.

[43]  Andrew Lim,et al.  Capturing Expert Arguments from Medical Adjudication Discussions in a Machine-readable Format , 2019, WWW.

[44]  Michael Cormier,et al.  Trusted AI and the Contribution of Trust Modeling in Multiagent Systems , 2019, AAMAS.

[45]  Subhashini Venugopalan,et al.  Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. , 2016, JAMA.

[46]  Ann L. Albright,et al.  Prevalence of diabetic retinopathy in the United States, 2005-2008. , 2010, JAMA.

[47]  Jonathan Krause,et al.  Grader variability and the importance of reference standards for evaluating machine learning models for diabetic retinopathy , 2017, Ophthalmology.

[48]  Bahador Bahrami,et al.  Aggregated knowledge from a small number of debates outperforms the wisdom of large crowds , 2017, Nature Human Behaviour.