Evaluating Progress on Machine Learning for Longitudinal Electronic Healthcare Data

The Large Scale Visual Recognition Challenge based on the well-known Imagenet dataset catalyzed an intense flurry of progress in computer vision. Benchmark tasks have propelled other sub-fields of machine learning forward at an equally impressive pace, but in healthcare it has primarily been image processing tasks, such as in dermatology and radiology, that have experienced similar benchmark-driven progress. In the present study, we performed a comprehensive review of benchmarks in medical machine learning for structured data, identifying one based on the Medical Information Mart for Intensive Care (MIMIC-III) that allows the first direct comparison of predictive performance and thus the evaluation of progress on four clinical prediction tasks: mortality, length of stay, phenotyping, and patient decompensation. We find that little meaningful progress has been made over a 3 year period on these tasks, despite significant community engagement. Through our meta-analysis, we find that the performance of deep recurrent models is only superior to logistic regression on certain tasks. We conclude with a synthesis of these results, possible explanations, and a list of desirable qualities for future benchmarks in medical machine learning.

[1]  I. Kohane,et al.  Escaping the EHR trap--the future of health IT. , 2012, The New England journal of medicine.

[2]  H. R. Warner,et al.  The HELP system , 1982, Journal of Medical Systems.

[3]  Aram Galstyan,et al.  Multitask learning and benchmarking with clinical time series data , 2017, Scientific Data.

[4]  George Shih,et al.  A patient-centric dataset of images and metadata for identifying melanomas using clinical context , 2020, Scientific Data.

[5]  Anna Goldenberg,et al.  Bayesian Trees for Automated Cytometry Data Analysis. , 2019 .

[6]  I. Kohane,et al.  Translating Artificial Intelligence Into Clinical Care. , 2016, JAMA.

[7]  Christian Bock,et al.  Set Functions for Time Series , 2019, ICML.

[8]  Christian R. Shelton,et al.  Marked Point Process for Severity of Illness Assessment , 2017, MLHC.

[9]  Parisa Rashidi,et al.  Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis , 2017, IEEE Journal of Biomedical and Health Informatics.

[10]  Vijay S. Pande,et al.  MoleculeNet: a benchmark for molecular machine learning , 2017, Chemical science.

[11]  Andrew Beam,et al.  Sharpening the Resolution on Data Matters: A Brief Roadmap for Understanding Deep Learning for Medical Data. , 2020, The spine journal : official journal of the North American Spine Society.

[12]  Wei Liu,et al.  Distilled Wasserstein Learning for Word Embedding and Topic Modeling , 2018, NeurIPS.

[13]  Nicholas Ayache,et al.  Medical Image Analysis: Progress over Two Decades and the Challenges Ahead , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Dong Liu,et al.  DCCL: A Benchmark for Cervical Cytology Analysis , 2019, MLMI@MICCAI.

[15]  Yordan Zaykov,et al.  Interpretable Outcome Prediction with Sparse Bayesian Neural Networks in Intensive Care , 2019, ArXiv.

[16]  Anum Abdul Salam,et al.  Benchmark data set for glaucoma detection with annotated cup to disc ratio , 2017, 2017 International Conference on Signals and Systems (ICSigSys).

[17]  Fei Wang,et al.  Explicit-Blurred Memory Network for Analyzing Patient Electronic Health Records , 2019, ArXiv.

[18]  Jiangtao Wang,et al.  AdaCare: Explainable Clinical Health Status Representation Learning via Scale-Adaptive Feature Extraction and Recalibration , 2019, AAAI.

[19]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[20]  et al.,et al.  ISLES 2015 ‐ A public evaluation benchmark for ischemic stroke lesion segmentation from multispectral MRI , 2017, Medical Image Anal..

[21]  Jimeng Sun,et al.  StageNet: Stage-Aware Neural Networks for Health Risk Prediction , 2020, WWW.

[22]  Nick C Fox,et al.  The Alzheimer's disease neuroimaging initiative (ADNI): MRI methods , 2008, Journal of magnetic resonance imaging : JMRI.

[23]  Giuseppe De Pietro,et al.  A new database of healthy and pathological voices , 2018, Comput. Electr. Eng..

[24]  Andreas Stafylopatis,et al.  Machine Learning for Neurodegenerative Disorder Diagnosis - Survey of Practices and Launch of Benchmark Dataset , 2018, Int. J. Artif. Intell. Tools.

[25]  Berthold Reinwald,et al.  Learning Electronic Health Records through Hyperbolic Embedding of Medical Ontologies , 2019, BCB.

[26]  Anna Goldenberg,et al.  Feature Robustness in Non-stationary Health Records: Caveats to Deployable Model Performance in Common Clinical Machine Learning Tasks , 2019, MLHC.

[27]  Xin Wang,et al.  Modeling the Uncertainty in Electronic Health Records: a Bayesian Deep Learning Approach , 2019, ArXiv.

[28]  B B Swindell,et al.  A graphical ICU workstation. , 1991, Proceedings. Symposium on Computer Applications in Medical Care.

[29]  Jenna Wiens,et al.  Relaxed Parameter Sharing: Effectively Modeling Time-Varying Relationships in Clinical Time-Series , 2019, MLHC.

[30]  Aris Gkoulalas-Divanis,et al.  Differential Privacy-enabled Federated Learning for Sensitive Health Data , 2019, ArXiv.

[31]  Edson Amaro Júnior,et al.  SEVERITAS: An externally validated mortality prediction for critically ill patients in low and middle-income countries , 2019, Int. J. Medical Informatics.

[32]  Yan Liu,et al.  Benchmarking deep learning models on large healthcare datasets , 2018, J. Biomed. Informatics.

[33]  Priyanka Gupta,et al.  Transfer Learning for Clinical Time Series Analysis using Recurrent Neural Networks , 2018, ArXiv.

[34]  I. Obeid,et al.  The TUH EEG CORPUS: A big data resource for automated EEG interpretation , 2014, 2014 IEEE Signal Processing in Medicine and Biology Symposium (SPMB).

[35]  Mohammad Taha Bahadori,et al.  Temporal-Clustering Invariance in Irregular Healthcare Time Series , 2019, ArXiv.

[36]  Rafael T. Sousa,et al.  Improving Irregularly Sampled Time Series Learning with Dense Descriptors of Time , 2020, ArXiv.

[37]  Brett K. Beaulieu-Jones,et al.  Trends and Focus of Machine Learning Applications for Health Research , 2019, JAMA network open.

[38]  Majid Sarrafzadeh,et al.  Hierarchical Target-Attentive Diagnosis Prediction in Heterogeneous Information Networks , 2019, 2019 International Conference on Data Mining Workshops (ICDMW).

[39]  Matthias Samwald,et al.  OpenBioLink: a benchmarking framework for large-scale biomedical link prediction , 2020, Bioinform..

[40]  Andreas Spanias,et al.  Attend and Diagnose: Clinical Time Series Analysis using Attention Models , 2017, AAAI.

[41]  J W Loonsk,et al.  Digital patient records and the medical desktop: an integrated physician workstation for medical informatics training. , 1992, Proceedings. Symposium on Computer Applications in Medical Care.

[42]  Busra Celikkaya,et al.  Improving Hospital Mortality Prediction with Medical Named Entities and Multimodal Learning , 2018, ArXiv.

[43]  D Carvalho,et al.  Big data and machine learning in health , 2020 .

[44]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[45]  W W Stead,et al.  Computer-based medical records: the centerpiece of TMR. , 1988, M.D. computing : computers in medical practice.

[46]  G.B. Moody,et al.  The impact of the MIT-BIH Arrhythmia Database , 2001, IEEE Engineering in Medicine and Biology Magazine.

[47]  Jimeng Sun,et al.  RAIM: Recurrent Attentive and Intensive Model of Multimodal Patient Monitoring Data , 2018, KDD.

[48]  V. Osmani,et al.  Benchmarking machine learning models on multi-centre eICU critical care dataset , 2019, PloS one.

[49]  Brian B. Avants,et al.  The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS) , 2015, IEEE Transactions on Medical Imaging.

[50]  Thomas B. Moeslund,et al.  Long-Term Occupancy Analysis Using Graph-Based Optimisation in Thermal Imagery , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Kai Wang,et al.  A Benchmark for Automatic Visual Classification of Clinical Skin Disease Images , 2016, ECCV.

[52]  Cao Xiao,et al.  Dr. Agent: Clinical predictive model via mimicked second opinions , 2020, J. Am. Medical Informatics Assoc..

[53]  Jenna Wiens,et al.  Learning to Exploit Invariances in Clinical Time-Series Data using Sequence Transformer Networks , 2018, MLHC.