MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports

Chest radiography is an extremely powerful imaging modality, allowing for a detailed inspection of a patient’s chest, but requires specialized training for proper interpretation. With the advent of high performance general purpose computer vision algorithms, the accurate automated analysis of chest radiographs is becoming increasingly of interest to researchers. Here we describe MIMIC-CXR, a large dataset of 227,835 imaging studies for 65,379 patients presenting to the Beth Israel Deaconess Medical Center Emergency Department between 2011–2016. Each imaging study can contain one or more images, usually a frontal view and a lateral view. A total of 377,110 images are available in the dataset. Studies are made available with a semi-structured free-text radiology report that describes the radiological findings of the images, written by a practicing radiologist contemporaneously during routine clinical care. All images and reports have been de-identified to protect patient privacy. The dataset is made freely available to facilitate and encourage a wide range of research in computer vision, natural language processing, and clinical data mining.

[1]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Peter Szolovits,et al.  Automated de-identification of free-text medical records , 2008, BMC Medical Informatics Decis. Mak..

[3]  S. Kennedy,et al.  Diagnostic Radiology in Liberia: A Country Report , 2015 .

[4]  Ian M. Mitchell,et al.  Best Practices for Scientific Computing , 2012, PLoS biology.

[5]  H L Bleich,et al.  Advances in radiologic reporting with Computerized Language Information Processing (CLIP). , 1979, Radiology.

[6]  Kenneth W Goodman,et al.  The CITI Program: An International Online Resource for Education in Human Subjects Protection and the Responsible Conduct of Research , 2007, Academic medicine : journal of the Association of American Medical Colleges.

[7]  Morris Simon An Improved Radiological Classification of Diseases , 1965 .

[8]  Clement J. McDonald,et al.  Preparing a collection of radiology examinations for distribution and retrieval , 2015, J. Am. Medical Informatics Assoc..

[9]  Richard Duszak,et al.  The U.S. Radiologist Workforce: An Analysis of Temporal and Geographic Variation by Using Large National Datasets. , 2016, Radiology.

[10]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[11]  K. Doi,et al.  Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists' detection of pulmonary nodules. , 2000, AJR. American journal of roentgenology.

[12]  Antonio Pertusa,et al.  PadChest: A large chest x-ray image dataset with multi-label annotated reports , 2019, Medical Image Anal..

[13]  Clement J. McDonald,et al.  Lung Segmentation in Chest Radiographs Using Anatomical Atlases With Nonrigid Registration , 2014, IEEE Transactions on Medical Imaging.

[14]  Leo A. Celi,et al.  The MIMIC Code Repository: enabling reproducibility in critical care research , 2017, J. Am. Medical Informatics Assoc..

[15]  A. Reisner,et al.  De-identification algorithm for free-text nursing notes , 2005, Computers in Cardiology, 2005.

[16]  Yifan Yu,et al.  CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison , 2019, AAAI.

[17]  Sébastien Jodogne,et al.  Orthanc - A lightweight, restful DICOM server for healthcare and medical research , 2013, 2013 IEEE 10th International Symposium on Biomedical Imaging.

[18]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[19]  Abi Rimmer,et al.  Radiologist shortage leaves patient care at risk, warns royal college , 2017, British Medical Journal.

[20]  Margaret Douglass,et al.  Computer-Assisted De-Identification of Free-text Nursing Notes , 2005 .

[21]  Sarah Bastawrous,et al.  Improving Patient Safety: Avoiding Unread Imaging Exams in the National VA Enterprise Electronic Health Record , 2017, Journal of Digital Imaging.

[22]  et al.,et al.  Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.

[23]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[24]  Ronald M. Summers,et al.  ChestX-ray: Hospital-Scale Chest X-ray Database and Benchmarks on Weakly Supervised Classification and Localization of Common Thorax Diseases , 2019, Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics.

[25]  H L Bleich,et al.  Computerized radiology reporting using coded language. , 1974, Radiology.

[26]  Alistair E. W. Johnson,et al.  The eICU Collaborative Research Database, a freely available multi-center database for critical care research , 2018, Scientific Data.

[27]  D. Rosman,et al.  Imaging in the Land of 1000 Hills: Rwanda Radiology Country Report , 2015 .

[28]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[29]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[30]  Richard Duszak,et al.  A County-Level Analysis of the US Radiologist Workforce: Physician Supply and Subspecialty Characteristics. , 2018, Journal of the American College of Radiology : JACR.

[31]  Stefan Jaeger,et al.  Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. , 2014, Quantitative imaging in medicine and surgery.

[32]  Roger G. Mark,et al.  Reproducibility in critical care: a mortality prediction case study , 2017, MLHC.