COVID-19-CT-CXR: A Freely Accessible and Weakly Labeled Chest X-Ray and CT Image Collection on COVID-19 From Biomedical Literature

The latest threat to global health is the COVID-19 outbreak. Although there exist large datasets of chest X-rays (CXR) and computed tomography (CT) scans, few COVID-19 image collections are currently available due to patient privacy. At the same time, there is a rapid growth of COVID-19-relevant articles in the biomedical literature, including those that report findings on radiographs. Here, we present COVID-19-CT-CXR, a public database of COVID-19 CXR and CT images, which are automatically extracted from COVID-19-relevant articles from the PubMed Central Open Access (PMC-OA) Subset. We extracted figures, associated captions, and relevant figure descriptions in the article and separated compound figures into subfigures. Because a large portion of figures in COVID-19 articles are not CXR or CT, we designed a deep-learning model to distinguish them from other figure types and to classify them accordingly. The final database includes 1,327 CT and 263 CXR images (as of May 9, 2020) with their relevant text. To demonstrate the utility of COVID-19-CT-CXR, we conducted four case studies. (1) We show that COVID-19-CT-CXR, when used as additional training data, is able to contribute to improved deep-learning (DL) performance for the classification of COVID-19 and non-COVID-19 CT. (2) We collected CT images of influenza, another common infectious respiratory illness that may present similarly to COVID-19, and fine-tuned a baseline deep neural network to distinguish a diagnosis of COVID-19, influenza, or normal or other types of diseases on CT. (3) We fine-tuned an unsupervised one-class classifier from non-COVID-19 CXR and performed anomaly detection to detect COVID-19 CXR. (4) From text-mined captions and figure descriptions, we compared 15 clinical symptoms and 20 clinical findings of COVID-19 versus those of influenza to demonstrate the disease differences in the scientific publications. Our database is unique, as the figures are retrieved along with relevant text with fine-grained descriptions, and it can be extended easily in the future. We believe that our work is complementary to existing resources and hope that it will contribute to medical image analysis of the COVID-19 pandemic. The dataset, code, and DL models are publicly available at https://github.com/ncbi-nlp/COVID-19-CT-CXR.

[1]  Jonathan H. Chung,et al.  Updated Fleischner Society Guidelines for Managing Incidental Pulmonary Nodules: Common Questions and Challenging Scenarios. , 2018, Radiographics : a review publication of the Radiological Society of North America, Inc.

[2]  Yan Zhao,et al.  Clinical Characteristics of 138 Hospitalized Patients With 2019 Novel Coronavirus-Infected Pneumonia in Wuhan, China. , 2020, JAMA.

[3]  Manabu Torii,et al.  A framework for biomedical figure segmentation towards image-based document retrieval , 2013, BMC Systems Biology.

[4]  Ronald M. Summers,et al.  ChestX-ray: Hospital-Scale Chest X-ray Database and Benchmarks on Weakly Supervised Classification and Localization of Common Thorax Diseases , 2019, Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics.

[5]  R. Fisher On the Interpretation of χ2 from Contingency Tables, and the Calculation of P , 2010 .

[6]  H. Hou,et al.  Using a diagnostic model based on routine laboratory tests to distinguish patients infected with SARS-CoV-2 from those infected with influenza virus , 2020, International Journal of Infectious Diseases.

[7]  Zhiyong Lu,et al.  Keep up with the latest coronavirus research , 2020, Nature.

[8]  Zhiyong Lu,et al.  PMC text mining subset in BioC: about three million full-text articles and growing , 2019, Bioinform..

[9]  Waleed Ammar,et al.  Extracting Scientific Figures with Distantly Supervised Neural Networks , 2018, JCDL.

[10]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ting Yu,et al.  Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study , 2020, The Lancet.

[12]  T.Y. Lin,et al.  Anomaly detection , 1994, Proceedings New Security Paradigms Workshop.

[13]  R. Fisher On the Interpretation of χ2 from Contingency Tables, and the Calculation of P , 2018, Journal of the Royal Statistical Society Series A (Statistics in Society).

[14]  Mining biomedical images towards valuable information retrieval in biomedical and life sciences , 2016, Database J. Biol. Databases Curation.

[15]  David J. Crandall,et al.  A Data Driven Approach for Compound Figure Separation Using Convolutional Neural Networks , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[16]  Bo Xu,et al.  A deep learning algorithm using CT images to screen for Corona virus disease (COVID-19) , 2020, European Radiology.

[17]  Oren Etzioni,et al.  CORD-19: The Covid-19 Open Research Dataset , 2020, NLPCOVID19.

[18]  Zeeshan Ahmed,et al.  Mining biomedical images towards valuable information retrieval in biomedical and life sciences , 2016, Database J. Biol. Databases Curation.

[19]  Z. Fayad,et al.  Artificial intelligence–enabled rapid diagnosis of patients with COVID-19 , 2020, Nature Medicine.

[20]  Dinggang Shen,et al.  Review of Artificial Intelligence Techniques in Imaging Data Acquisition, Segmentation, and Diagnosis for COVID-19 , 2020, IEEE Reviews in Biomedical Engineering.

[21]  C. Eastin,et al.  Clinical Characteristics of Coronavirus Disease 2019 in China , 2020, The Journal of Emergency Medicine.

[22]  Stefano Bromuri,et al.  Overview of the medical tasks in ImageCLEF 2016 , 2016 .

[23]  Judith A. Blake,et al.  Mouse Genome Database (MGD)-2018: knowledgebase for the laboratory mouse , 2017, Nucleic Acids Res..

[24]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Forrest N. Iandola,et al.  DenseNet: Implementing Efficient ConvNet Descriptor Pyramids , 2014, ArXiv.

[26]  Jun Liu,et al.  Chest CT for Typical 2019-nCoV Pneumonia: Relationship to Negative RT-PCR Testing , 2020, Radiology.

[27]  Roger G. Mark,et al.  MIMIC-CXR: A large publicly available database of labeled chest radiographs , 2019, ArXiv.

[28]  Wenyu Liu,et al.  Deep Learning-based Detection for COVID-19 from Chest CT using Weak Label , 2020, medRxiv.

[29]  R. Redfield,et al.  Covid-19 — Navigating the Uncharted , 2020, The New England journal of medicine.

[30]  P. Xie,et al.  COVID-CT-Dataset: A CT Scan Dataset about COVID-19 , 2020, ArXiv.

[31]  R. Lynfield,et al.  Red Book: 2018-2021 report of the committee on infectious diseases. , 2018 .

[32]  Yifan Yu,et al.  CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison , 2019, AAAI.

[33]  Charmaine Butt,et al.  Deep learning system to screen coronavirus disease 2019 pneumonia , 2020, Applied Intelligence.

[34]  Zhiyong Lu,et al.  Automated abnormality classification of chest radiographs using deep convolutional neural networks , 2020, npj Digital Medicine.

[35]  Le Lu,et al.  DeepLesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning , 2018, Journal of medical imaging.

[36]  R. Summers,et al.  Abnormal Chest X-Ray Identification With Generative Adversarial One-Class Classifier , 2019, 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019).

[37]  Hagit Shatkay,et al.  Figure and caption extraction from biomedical documents , 2019, Bioinform..

[38]  C. Jung,et al.  The Red Book , 2009 .

[39]  W. Liang,et al.  Clinically Applicable AI System for Accurate Diagnosis, Quantitative Measurements, and Prognosis of COVID-19 Pneumonia Using Computed Tomography , 2020, Cell.

[40]  X. He,et al.  Sample-Efficient Deep Learning for COVID-19 Diagnosis Based on CT Scans , 2020, medRxiv.

[41]  Ronald M. Summers,et al.  NegBio: a high-performance tool for negation and uncertainty detection in radiology reports , 2017, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[42]  Jun Chen,et al.  Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography , 2020, Scientific Reports.

[43]  Lior Rokach,et al.  A figure search engine architecture for a chemistry digital library , 2013, JCDL '13.

[44]  Chunhua Shen,et al.  COVID-19 Screening on Chest X-ray Images Using Deep Learning based Anomaly Detection , 2020, ArXiv.

[45]  Lian-lian Wu,et al.  Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography: a prospective study , 2020, medRxiv.

[46]  Joseph Paul Cohen,et al.  COVID-19 Image Data Collection , 2020, ArXiv.

[47]  C. V. Jawahar,et al.  DocFigure: A Dataset for Scientific Document Figure Classification , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[48]  Senay Kafkas,et al.  Section level search functionality in Europe PMC , 2015, J. Biomed. Semant..

[49]  Development and Evaluation of an AI System for COVID-19 Diagnosis , 2020 .

[50]  K. Yuen,et al.  Clinical Characteristics of Coronavirus Disease 2019 in China , 2020, The New England journal of medicine.

[51]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[52]  Heshui Shi,et al.  Radiological findings from 81 patients with COVID-19 pneumonia in Wuhan, China: a descriptive study , 2020, The Lancet Infectious Diseases.

[53]  Chao Lan,et al.  Anomaly Detection , 2018, Encyclopedia of GIS.

[54]  Yan Zhao,et al.  A rapid advice guideline for the diagnosis and treatment of 2019 novel coronavirus (2019-nCoV) infected pneumonia (standard version) , 2020, Military Medical Research.