MIMIC-Extract: a data extraction, preprocessing, and representation pipeline for MIMIC-III

Machine learning for healthcare researchers face challenges to progress and reproducibility due to a lack of standardized processing frameworks for public datasets. We present MIMIC-Extract, an open source pipeline for transforming the raw electronic health record (EHR) data of critical care patients from the publicly-available MIMIC-III database into data structures that are directly usable in common time-series prediction pipelines. MIMIC-Extract addresses three challenges in making complex EHR data accessible to the broader machine learning community. First, MIMIC-Extract transforms raw vital sign and laboratory measurements into usable hourly time series, performing essential steps such as unit conversion, outlier handling, and aggregation of semantically similar features to reduce missingness and improve robustness. Second, MIMIC-Extract extracts and makes prediction of clinically-relevant targets possible, including outcomes such as mortality and length-of-stay as well as comprehensive hourly intervention signals for ventilators, vasopressors, and fluid therapies. Finally, the pipeline emphasizes reproducibility and extensibility to future research questions. We demonstrate the pipeline's effectiveness by developing several benchmark tasks for outcome and intervention forecasting and assessing the performance of competitive models.

[1]  Charles Elkan,et al.  Learning to Diagnose with LSTM Recurrent Neural Networks , 2015, ICLR.

[2]  Aram Galstyan,et al.  Multitask learning and benchmarking with clinical time series data , 2017, Scientific Data.

[3]  M. Müllner,et al.  Vasopressors for shock. , 2004, The Cochrane database of systematic reviews.

[4]  M. Meade,et al.  Blood Pressure Targets For Vasopressor Therapy: A Systematic Review , 2015, Shock.

[5]  Peter Szolovits,et al.  Continuous State-Space Models for Optimal Sepsis Treatment: a Deep Reinforcement Learning Approach , 2017, MLHC.

[6]  M. Tobin,et al.  A prospective study of indexes predicting the outcome of trials of weaning from mechanical ventilation , 1991 .

[7]  Anna Rumshisky,et al.  Unfolding physiological state: mortality modelling in intensive care units , 2014, KDD.

[8]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[9]  Yan Liu,et al.  Recurrent Neural Networks for Multivariate Time Series with Missing Values , 2016, Scientific Reports.

[10]  F. Lemaire,et al.  Principles and practice of mechanical ventilation , 1995, Intensive Care Medicine.

[11]  Luca Foschini,et al.  Reproducibility in Machine Learning for Health , 2019, RML@ICLR.

[12]  Peter Szolovits,et al.  Predicting intervention onset in the ICU with switching state space models , 2017, CRI.

[13]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[14]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[15]  Peter Szolovits,et al.  Understanding vasopressor intervention and weaning: risk prediction in a public heterogeneous clinical time series database , 2017, J. Am. Medical Informatics Assoc..

[16]  Biases in electronic health record data due to processes within the healthcare system: retrospective observational study , 2018, British Medical Journal.

[17]  Leo A. Celi,et al.  The MIMIC Code Repository: enabling reproducibility in critical care research , 2017, J. Am. Medical Informatics Assoc..

[18]  Jimeng Sun,et al.  MiME: Multilevel Medical Embedding of Electronic Health Records for Predictive Healthcare , 2018, NeurIPS.

[19]  Anna Goldenberg,et al.  Bayesian Trees for Automated Cytometry Data Analysis. , 2019 .

[20]  P. Marik,et al.  Fluid overload, de-resuscitation, and outcomes in critically ill or injured patients: a systematic review with suggestions for clinical practice. , 2014, Anaesthesiology intensive therapy.

[21]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[22]  Roger G. Mark,et al.  Reproducibility in critical care: a mortality prediction case study , 2017, MLHC.

[23]  Yan Liu,et al.  Benchmarking deep learning models on large healthcare datasets , 2018, J. Biomed. Informatics.

[24]  Peter Szolovits,et al.  Semi-Supervised Biomedical Translation With Cycle Wasserstein Regression GANs , 2018, AAAI.

[25]  Jimeng Sun,et al.  RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism , 2016, NIPS.

[26]  Peter Szolovits,et al.  Clinical Intervention Prediction and Understanding with Deep Neural Networks , 2017, MLHC.

[27]  Peter Szolovits,et al.  A Multivariate Timeseries Modeling Approach to Severity of Illness Assessment and Forecasting in ICU with Sparse, Heterogeneous Clinical Data , 2015, AAAI.