COVID-19 diagnosis by routine blood tests using machine learning

Physicians taking care of patients with coronavirus disease (COVID-19) have described different changes in routine blood parameters. However, these changes, hinder them from performing COVID-19 diagnosis. We constructed a machine learning predictive model for COVID-19 diagnosis. The model was based and cross-validated on the routine blood tests of 5,333 patients with various bacterial and viral infections, and 160 COVID-19-positive patients. We selected operational ROC point at a sensitivity of 81.9% and specificity of 97.9%. The cross-validated area under the curve (AUC) was 0.97. The five most useful routine blood parameters for COVID19 diagnosis according to the feature importance scoring of the XGBoost algorithm were MCHC, eosinophil count, albumin, INR, and prothrombin activity percentage. tSNE visualization showed that the blood parameters of the patients with severe COVID-19 course are more like the parameters of bacterial than viral infection. The reported diagnostic accuracy is at least comparable and probably complementary to RT-PCR and chest CT studies. Patients with fever, cough, myalgia, and other symptoms can now have initial routine blood tests assessed by our diagnostic tool. All patients with a positive COVID-19 prediction would then undergo standard RT-PCR studies to confirm the diagnosis. We believe that our results present a significant contribution to improvements in COVID-19 diagnosis.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  Язык программирования,et al.  Cross Industry Standard Process for Data Mining , 2010 .

[3]  Alfonso J. Rodriguez-Morales,et al.  Clinical, laboratory and imaging features of COVID-19: A systematic review and meta-analysis , 2020, Travel Medicine and Infectious Disease.

[4]  J. Bengoechea,et al.  SARS‐CoV‐2, bacterial co‐infections, and AMR: the deadly trio in COVID‐19? , 2020, EMBO molecular medicine.

[5]  M. Stephens,et al.  K-Sample Anderson–Darling Tests , 1987 .

[6]  Jian Xu,et al.  A Regularization-Based eXtreme Gradient Boosting Approach in Foodborne Disease Trend Forecasting , 2019, MedInfo.

[7]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[8]  N. Lo,et al.  Scientific and ethical basis for social-distancing interventions against COVID-19 , 2020, The Lancet Infectious Diseases.

[9]  Dengju Li,et al.  Anticoagulant treatment is associated with decreased mortality in severe coronavirus disease 2019 patients with coagulopathy , 2020, Journal of Thrombosis and Haemostasis.

[10]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[11]  Shuyan Li,et al.  Rapid and accurate identification of COVID-19 infection through machine learning based on clinical available blood test results , 2020, medRxiv.

[12]  Namita Srivastava,et al.  The Machine‐Learning Approach , 2020, Machine Learning for iOS Developers.

[13]  G. Gao,et al.  A Novel Coronavirus from Patients with Pneumonia in China, 2019 , 2020, The New England journal of medicine.

[14]  A. F. M. Batista,et al.  COVID-19 diagnosis prediction in emergency care patients: a machine learning approach , 2020, medRxiv.

[15]  L. Brown,et al.  Interval Estimation for a Binomial Proportion , 2001 .

[16]  Constantine A Raptis,et al.  A role for CT in COVID-19? What data really tell us so far , 2020, The Lancet.

[17]  M. Delgado-Rodríguez,et al.  Systematic review and meta-analysis. , 2017, Medicina intensiva.

[18]  Mario Plebani,et al.  Potential preanalytical and analytical vulnerabilities in the laboratory diagnosis of coronavirus disease 2019 (COVID-19) , 2020, Clinical chemistry and laboratory medicine.

[19]  Rok Blagus,et al.  SMOTE for high-dimensional class-imbalanced data , 2013, BMC Bioinformatics.

[20]  Michael J. Loeffelholz,et al.  Laboratory diagnosis of emerging human coronavirus infections – the state of the art , 2020, Emerging microbes & infections.

[21]  Matjaž Kukar,et al.  An application of machine learning to haematological diagnosis , 2017, Scientific Reports.

[22]  Victor M Corman,et al.  Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR , 2020, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[23]  Didrik Nielsen,et al.  Tree Boosting With XGBoost - Why Does XGBoost Win "Every" Machine Learning Competition? , 2016 .

[24]  Chonggang Xu,et al.  High Contagiousness and Rapid Spread of Severe Acute Respiratory Syndrome Coronavirus 2 , 2020, Emerging infectious diseases.

[25]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[26]  Matjaž Kukar,et al.  Application of machine learning for hematological diagnosis , 2017 .

[27]  Dasheng Li,et al.  False-Negative Results of Real-Time Reverse-Transcriptase Polymerase Chain Reaction for Severe Acute Respiratory Syndrome Coronavirus 2: Role of Deep-Learning-Based CT Diagnosis and Insights from Two Cases , 2020, Korean journal of radiology.

[28]  Haoyang Sun,et al.  Interventions to mitigate early spread of SARS-CoV-2 in Singapore: a modelling study , 2020, The Lancet Infectious Diseases.

[29]  Philipp Berens,et al.  The art of using t-SNE for single-cell transcriptomics , 2019, Nature Communications.

[30]  N. Schmidt,et al.  Overview: Systemic Inflammatory Response Derived From Lung Injury Caused by SARS-CoV-2 Infection Explains Severe Outcomes in COVID-19 , 2020, Frontiers in Immunology.

[31]  David Moher,et al.  Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD Initiative. , 2003, Radiology.

[32]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[33]  J. Friedman Stochastic gradient boosting , 2002 .

[34]  Q. Tao,et al.  Correlation of Chest CT and RT-PCR Testing in Coronavirus Disease 2019 (COVID-19) in China: A Report of 1014 Cases , 2020, Radiology.

[35]  D. Rennie,et al.  Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative , 2003, Annals of Internal Medicine.

[36]  K. Yuen,et al.  Clinical Characteristics of Coronavirus Disease 2019 in China , 2020, The New England journal of medicine.

[37]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[38]  M. Salathé,et al.  COVID-19 epidemic in Switzerland: on the importance of testing, contact tracing and isolation. , 2020, Swiss medical weekly.

[39]  M. Kukar,et al.  Diagnosing brain tumours by routine blood tests using machine learning , 2019, Scientific Reports.

[40]  Julio López,et al.  An alternative SMOTE oversampling strategy for high-dimensional datasets , 2019, Appl. Soft Comput..

[41]  Martin Wattenberg,et al.  How to Use t-SNE Effectively , 2016 .

[42]  Blaž Zupan,et al.  openTSNE: a modular Python library for t-SNE dimensionality reduction and embedding , 2019, bioRxiv.

[43]  Markus Voelter,et al.  State of the Art , 1997, Pediatric Research.

[44]  Peter A. Flach,et al.  A Coherent Interpretation of AUC as a Measure of Aggregated Classification Performance , 2011, ICML.

[45]  V Kishore Ayyadevara,et al.  Gradient Boosting Machine , 2018 .

[46]  A. M. Leontovich,et al.  The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2 , 2020, Nature Microbiology.

[47]  Lei Liu,et al.  Evaluating the accuracy of different respiratory specimens in the laboratory diagnosis and monitoring the viral shedding of 2019-nCoV infections , 2020, medRxiv.

[48]  Johannes B Reitsma,et al.  The STARD initiative , 2003, The Lancet.

[49]  Y. Hu,et al.  Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China , 2020, The Lancet.