An Overview of Healthcare Data Analytics With Applications to the COVID-19 Pandemic

4 Abstract—In the era of big data, standard analysis tools may be inadequate for making inference and there is a growing need for more 5 efficient and innovative ways to collect, process, analyze and interpret the massive and complex data. We provide an overview of 6 challenges in big data problems and describe how innovative analytical methods, machine learning tools and metaheuristics can tackle 7 general healthcare problems with a focus on the current pandemic. In particular, we give applications of modern digital technology, 8 statistical methods,data platforms and data integration systems to improve diagnosis and treatment of diseases in clinical research and 9 novel epidemiologic tools to tackle infection source problems, such as finding Patient Zero in the spread of epidemics. We make the 10 case that analyzing and interpreting big data is a very challenging task that requires a multi-disciplinary effort to continuously create 11 more effective methodologies and powerful tools to transfer data information into knowledge that enables informed decision making. Q2

[1]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[2]  Abhyuday Mandal,et al.  d-QPSO: A Quantum-Behaved Particle Swarm Technique for Finding D-Optimal Designs With Discrete and Continuous Factors and a Binary Response , 2018, Technometrics.

[3]  A Barrat,et al.  Digital proximity tracing on empirical contact networks for pandemic control , 2021, Nature Communications.

[4]  James M. Whitacre Recent trends indicate rapid growth of nature-inspired optimization in academia and industry , 2011, Computing.

[5]  Zhe Fei,et al.  Estimation and Inference for High Dimensional Generalized Linear Models: A Splitting and Smoothing Approach , 2019, J. Mach. Learn. Res..

[6]  N. Ch. Sriman Narayana Iyengar,et al.  Optimal feature selection using a modified differential evolution algorithm and its effectiveness for prediction of heart disease , 2017, Comput. Biol. Medicine.

[7]  Joshua M. Korn,et al.  Pharmacological and genomic profiling identifies NF-κB–targeted treatment strategies for mantle cell lymphoma , 2013, Nature Medicine.

[8]  M. Kas,et al.  A Study of Novel Exploratory Tools, Digital Technologies, and Central Nervous System Biomarkers to Characterize Unipolar Depression , 2021, Frontiers in Psychiatry.

[9]  M. Hamilton A RATING SCALE FOR DEPRESSION , 1960, Journal of neurology, neurosurgery, and psychiatry.

[10]  R. Gloaguen,et al.  COVID-19 Pandemic Prediction for Hungary; a Hybrid Machine Learning Approach , 2020, medRxiv.

[11]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[12]  Colin Renfrew,et al.  Reply to Sánchez-Pacheco et al., Chookajorn, and Mavian et al.: Explaining phylogenetic network analysis of SARS-CoV-2 genomes , 2020, Proceedings of the National Academy of Sciences.

[13]  Xin-She Yang,et al.  Engineering Optimization: An Introduction with Metaheuristic Applications , 2010 .

[14]  Chee Wei Tan,et al.  Rumor source detection with multiple observations: fundamental limits and algorithms , 2014, SIGMETRICS '14.

[15]  B. Efron Estimation and Accuracy After Model Selection , 2014, Journal of the American Statistical Association.

[16]  G. Freedman,et al.  Burden of Depressive Disorders by Country, Sex, Age, and Year: Findings from the Global Burden of Disease Study 2010 , 2013, PLoS medicine.

[17]  Abdeltawab M. Hendawi,et al.  A Multi-Objective Optimization Method for Hospital Admission Problem - A Case Study on Covid-19 Patients , 2021, Algorithms.

[18]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[19]  Jiang Bian,et al.  Applications of artificial intelligence in drug development using real-world data , 2020, Drug discovery today.

[20]  George Barbastathis,et al.  A Machine Learning-Aided Global Diagnostic and Comparative Tool to Assess Effect of Quarantine Control in COVID-19 Spread , 2020, Patterns.

[21]  Ali Ramadhan,et al.  Universal Differential Equations for Scientific Machine Learning , 2020, ArXiv.

[22]  T. Thornton-Wells,et al.  Digital Therapeutics: An Integral Component of Digital Innovation in Drug Development , 2018, Clinical pharmacology and therapeutics.

[23]  Jon R. Armstrong,et al.  Genomic heterogeneity of ALK fusion breakpoints in non-small-cell lung cancer , 2018, Modern Pathology.

[24]  H. Nakagawa,et al.  Whole genome sequencing analysis for cancer genomics and precision medicine , 2018, Cancer science.

[25]  Ankita Dhar,et al.  Deep neural network to detect COVID-19: one architecture for both CT Scans and Chest X-rays , 2020, Applied Intelligence.

[26]  Ethan X. Fang,et al.  Testing and confidence intervals for high dimensional proportional hazards models , 2014, 1412.5158.

[27]  Weng Kee Wong,et al.  Orthogonal subsampling for big data linear regression , 2021, The Annals of Applied Statistics.

[28]  Sotiris B. Kotsiantis,et al.  Decision trees: a recent overview , 2011, Artificial Intelligence Review.

[29]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[30]  D. Clayton,et al.  Genome-wide association studies: theoretical and practical concerns , 2005, Nature Reviews Genetics.

[31]  Ye Tian,et al.  An Evolutionary Algorithm for Large-Scale Sparse Multiobjective Optimization Problems , 2020, IEEE Transactions on Evolutionary Computation.

[32]  James M. Whitacre,et al.  Survival of the flexible: explaining the recent popularity of nature-inspired optimization within a rapidly evolving world , 2011, Computing.

[33]  Stefan Wager,et al.  Estimation and Inference of Heterogeneous Treatment Effects using Random Forests , 2015, Journal of the American Statistical Association.

[34]  Peter J. Diggle,et al.  Statistics: a data science for the 21st century , 2015 .

[35]  S. B. Singh,et al.  Hybrid Algorithm of Particle Swarm Optimization and Grey Wolf Optimizer for Improving Convergence Performance , 2017, J. Appl. Math..

[36]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[37]  David C. Atkins,et al.  Depression Screening from Voice Samples of Patients Affected by Parkinson’s Disease , 2019, Digital Biomarkers.

[38]  Rahul G. Makade,et al.  Real-time estimation and prediction of the mortality caused due to COVID-19 using particle swarm optimization and finding the most influential parameter , 2020, Infectious Disease Modelling.

[39]  H. Zou,et al.  High Dimensional Inference , 2020 .

[40]  Syed Muhammad Anwar,et al.  Medical Image Analysis using Convolutional Neural Networks: A Review , 2017, Journal of Medical Systems.

[41]  Trevor Hastie,et al.  Computer Age Statistical Inference: Algorithms, Evidence, and Data Science , 2016 .

[42]  Sebastien Ourselin,et al.  Imaging endpoints for clinical trials in Alzheimer’s disease , 2014, Alzheimer's Research & Therapy.

[43]  Hafiz Tayyab Rauf,et al.  Time series forecasting of COVID-19 transmission in Asia Pacific countries using deep neural networks , 2021, Personal and ubiquitous computing.

[44]  Uzma Saleem,et al.  Inhibitors of Apoptotic Proteins: New Targets for Anticancer Therapy , 2013, Chemical biology & drug design.

[45]  Le Song,et al.  2 Common Formulation for Greedy Algorithms on Graphs , 2018 .

[46]  Valerio Persico,et al.  Big Data for Health , 2019, Encyclopedia of Big Data Technologies.

[47]  Ravinder Reddy,et al.  Molecular magnetic resonance imaging in cancer , 2015, Journal of Translational Medicine.

[48]  Ismail Shahin,et al.  COVID-19 Detection System using Recurrent Neural Networks , 2020, 2020 International Conference on Communications, Computing, Cybersecurity, and Informatics (CCCI).

[49]  P. Bühlmann,et al.  Analyzing Bagging , 2001 .

[50]  Ye Tian,et al.  EMODMI: A Multi-Objective Optimization Based Method to Identify Disease Modules , 2021, IEEE Transactions on Emerging Topics in Computational Intelligence.

[51]  Dennis L. Sun,et al.  Exact post-selection inference, with application to the lasso , 2013, 1311.6238.

[52]  Enkelejda Miho,et al.  Traditional and Digital Biomarkers: Two Worlds Apart? , 2019, Digital Biomarkers.

[53]  E. Candès,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[54]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[55]  M. De Domenico,et al.  Assessing the risks of ‘infodemics’ in response to COVID-19 epidemics , 2020, Nature Human Behaviour.

[56]  Devavrat Shah,et al.  Rumors in a Network: Who's the Culprit? , 2009, IEEE Transactions on Information Theory.

[57]  Vaishali,et al.  Classification of COVID-19 patients from chest CT images using multi-objective differential evolution–based convolutional neural networks , 2020, European Journal of Clinical Microbiology & Infectious Diseases.

[58]  Wei-De Zhong,et al.  miR-195 Inhibits Tumor Progression by Targeting RPS6KB1 in Human Prostate Cancer , 2015, Clinical Cancer Research.

[59]  Peter Bühlmann,et al.  p-Values for High-Dimensional Regression , 2008, 0811.2177.

[60]  Nathanael Chambers,et al.  Detecting Denial-of-Service Attacks from Social Media Text: Applying NLP to Computer Security , 2018, NAACL.

[61]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[62]  Uma N Dulhare,et al.  Prediction system for heart disease using Naive Bayes and particle swarm optimization , 2018 .

[63]  Juan Romo,et al.  Data learning from big data , 2018 .

[64]  Mohsen Guizani,et al.  COVID-19 Optimizer Algorithm, Modeling and Controlling of Coronavirus Distribution Process , 2020, IEEE Journal of Biomedical and Health Informatics.

[65]  M. Åsberg,et al.  A New Depression Scale Designed to be Sensitive to Change , 1979, British Journal of Psychiatry.

[66]  A. Rostami-Hodjegan,et al.  Precision dosing in clinical medicine: present and future , 2018, Expert review of clinical pharmacology.

[67]  K. Douglas,et al.  Processing of Facial Emotion Expression in Major Depression: A Review , 2010, The Australian and New Zealand journal of psychiatry.

[68]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[69]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[70]  E. Wang,et al.  Overexpression of yes‐associated protein contributes to progression and poor prognosis of non‐small‐cell lung cancer , 2010, Cancer science.

[71]  Julian Togelius,et al.  Evolving Memory Cell Structures for Sequence Learning , 2009, ICANN.

[72]  Vaibhav Dixit,et al.  DiffEqFlux.jl - A Julia Library for Neural Differential Equations , 2019, ArXiv.

[73]  N. Shomron,et al.  Machine learning-based prediction of COVID-19 diagnosis based on symptoms , 2021, npj Digital Medicine.

[74]  Jonathan M. Garibaldi,et al.  Parameter Estimation Using Metaheuristics in Systems Biology: A Comprehensive Review , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[75]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[76]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[77]  M. Banerjee,et al.  Drawing inferences for high‐dimensional linear models: A selection‐assisted partial regression and smoothing approach , 2019, Biometrics.

[78]  Cheng-Shang Chang,et al.  Positively Correlated Samples Save Pooled Testing Costs , 2020, IEEE Transactions on Network Science and Engineering.

[79]  Adel Javanmard,et al.  Confidence intervals and hypothesis testing for high-dimensional regression , 2013, J. Mach. Learn. Res..

[80]  Yong Zhang,et al.  Prognostic value of phosphorylated mTOR/RPS6KB1 in non- small cell lung cancer. , 2013, Asian Pacific journal of cancer prevention : APJCP.

[81]  Pei-Duo Yu,et al.  Epidemic Source Detection in Contact Tracing Networks: Epidemic Centrality in Graphs and Message-Passing Algorithms , 2022, ArXiv.

[82]  Andrew Rambaut,et al.  Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic , 2020, Nature Microbiology.

[83]  Joshua Lukemire,et al.  Optimal experimental designs for ordinal models with mixed factors for industrial and healthcare applications , 2020, Journal of Quality Technology.

[84]  Narasimha Prasad,et al.  Gain Ratio as Attribute Selection Measure in Elegant Decision Tree to Predict Precipitation , 2013, 2013 8th EUROSIM Congress on Modelling and Simulation.

[85]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[86]  Pierre Jallon,et al.  Closed-loop insulin delivery in adults with type 1 diabetes in real-life conditions: a 12-week multicentre, open-label randomised controlled crossover trial. , 2019, The Lancet. Digital health.

[87]  S. Galea,et al.  Prevalence of Depression Symptoms in US Adults Before and During the COVID-19 Pandemic , 2020, JAMA network open.

[88]  Kengo Kato,et al.  Valid Post-Selection Inference in High-Dimensional Approximately Sparse Quantile Regression Models , 2013, Journal of the American Statistical Association.

[89]  Hung-Lin Fu,et al.  Averting Cascading Failures in Networked Infrastructures: Poset-Constrained Graph Algorithms , 2018, IEEE Journal of Selected Topics in Signal Processing.

[90]  H. Vincent Poor,et al.  Modeling and Analysis of the Spread of COVID-19 Under a Multiple-Strain Model with Mutations , 2021 .

[91]  K. Roeder,et al.  Screen and clean: a tool for identifying interactions in genome‐wide association studies , 2010, Genetic epidemiology.

[92]  N. Meinshausen,et al.  High-Dimensional Inference: Confidence Intervals, $p$-Values and R-Software hdi , 2014, 1408.4026.

[93]  I. De Falco,et al.  Coronavirus Covid-19 spreading in Italy: optimizing an epidemiological model with dynamic social distancing through Differential Evolution , 2020, ArXiv.

[94]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[95]  Wojciech Niemiro Asymptotics for M-estimators defined by convex minimization , 1992 .

[96]  Cun-Hui Zhang,et al.  Confidence intervals for low dimensional parameters in high dimensional linear models , 2011, 1110.2563.

[97]  Kathleen M. Carley,et al.  The effects of evolutionary adaptations on spreading processes in complex networks , 2018, Proceedings of the National Academy of Sciences.

[98]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[99]  Michael Krawczak,et al.  Does big data require a methodological change in medical research? , 2019, BMC Medical Research Methodology.

[100]  Murtaza Haider,et al.  Beyond the hype: Big data concepts, methods, and analytics , 2015, Int. J. Inf. Manag..

[101]  Fergus J. Couch,et al.  The 17q23 Amplicon and Breast Cancer , 2003, Breast Cancer Research and Treatment.

[102]  Min Chen,et al.  Metaheuristic Algorithms for Healthcare: Open Issues and Challenges , 2016, Comput. Electr. Eng..

[103]  Cun-Hui Zhang,et al.  Adaptive Lasso for sparse high-dimensional regression models , 2008 .

[104]  Colin Renfrew,et al.  Phylogenetic network analysis of SARS-CoV-2 genomes , 2020, Proceedings of the National Academy of Sciences.

[105]  Cheng-Shang Chang,et al.  A Time-dependent SIR model for COVID-19 , 2020, ArXiv.

[106]  W. O. Kermack,et al.  A contribution to the mathematical theory of epidemics , 1927 .

[107]  Xin T. Tong,et al.  A STATISTICAL APPROACH TO ADAPTIVE PARAMETER TUNING IN NATURE-INSPIRED OPTIMIZATION AND OPTIMAL SEQUENTIAL DESIGN OF DOSE-FINDING TRIALS By , 2020 .

[108]  Xiang Ren,et al.  Precision Matrix Estimation in High Dimensional Gaussian Graphical Models with Faster Rates , 2016, AISTATS.

[109]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[110]  MontanariAndrea,et al.  Confidence intervals and hypothesis testing for high-dimensional regression , 2014 .

[111]  Mohd Saberi Mohamad,et al.  An Improved Swarm Optimization for Parameter Estimation and Biological Model Selection , 2013, PloS one.

[112]  C. Crainiceanu,et al.  A practical guide to big data. , 2018, Statistics & probability letters.

[113]  E. Glaser The randomized clinical trial. , 1972, The New England journal of medicine.

[114]  Yinchu Zhu,et al.  Linear Hypothesis Testing in Dense High-Dimensional Linear Models , 2016, Journal of the American Statistical Association.

[115]  S. Geer,et al.  On asymptotically optimal confidence regions and tests for high-dimensional models , 2013, 1303.0518.

[116]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[117]  Tuan D Pham A comprehensive study on classification of COVID-19 on computed tomography with pretrained convolutional neural networks , 2020, Scientific reports.

[118]  Ginalber Luiz de Oliveira Serra,et al.  Machine Learning Model for Computational Tracking and Forecasting the COVID-19 Dynamic Propagation , 2021, IEEE Journal of Biomedical and Health Informatics.

[119]  Christopher Engledowl,et al.  Data (Mis)representation and COVID-19: Leveraging Misleading Data Visualizations For Developing Statistical Literacy Across Grades 6–16 , 2021 .

[120]  R. Wolff,et al.  Genetic variation in RPS6KA1, RPS6KA2, RPS6KB1, RPS6KB2, and PDK1 and risk of colon or rectal cancer. , 2011, Mutation research.

[121]  J. Soucie,et al.  Public health surveillance and data collection: general principles and impact on hemophilia care , 2012, Hematology.

[122]  In Lee,et al.  Big data: Dimensions, evolution, impacts, and challenges , 2017 .

[123]  Yong-Yeol Ahn,et al.  The effectiveness of backward contact tracing in networks , 2020, Nature Physics.

[124]  Han Liu,et al.  A General Theory of Hypothesis Tests and Confidence Regions for Sparse High Dimensional Models , 2014, 1412.8765.

[125]  M. Filippi,et al.  Magnetic resonance imaging metrics and their correlation with clinical outcomes in multiple sclerosis: a review of the literature and future perspectives , 2008, Multiple sclerosis.

[126]  Jianqing Fan,et al.  Sure independence screening in generalized linear models with NP-dimensionality , 2009, The Annals of Statistics.

[127]  Ye Tian,et al.  An Evolutionary Multiobjective Optimization Based Fuzzy Method for Overlapping Community Detection , 2020, IEEE Transactions on Fuzzy Systems.

[128]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[129]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[130]  Victor Chernozhukov,et al.  Inference on Treatment Effects after Selection Amongst High-Dimensional Controls , 2011 .

[131]  Mohammad-Reza Feizi-Derakhshi,et al.  New hybrid method for heart disease diagnosis utilizing optimization algorithm in feature selection , 2019, Health and Technology.

[132]  R. Dandekar,et al.  A Machine Learning-Aided Global Diagnostic and Comparative Tool to Assess Effect of Quarantine Control in COVID-19 Spread , 2020, Patterns.

[133]  Christian Blum,et al.  Hybrid Metaheuristics: Powerful Tools for Optimization , 2016 .

[134]  E. Holmes,et al.  A Genomic Perspective on the Origin and Emergence of SARS-CoV-2 , 2020, Cell.

[135]  Victor Chernozhukov,et al.  Post-Selection Inference for Generalized Linear Models With Many Controls , 2013, 1304.3969.

[136]  Peter Bühlmann,et al.  High-Dimensional Statistics with a View Toward Applications in Biology , 2014 .

[137]  D. Hand Statistical challenges of administrative and transaction data , 2018 .

[138]  Kay Chen Tan,et al.  Finding High-Dimensional D-Optimal Designs for Logistic Models via Differential Evolution , 2019, IEEE Access.

[139]  Jinchi Lv,et al.  A unified approach to model selection and sparse recovery using regularized least squares , 2009, 0905.3573.

[140]  Shaobo He,et al.  SEIR modeling of the COVID-19 and its dynamics , 2020, Nonlinear dynamics.

[141]  Weng Kee Wong,et al.  Stability bounds and almost sure convergence of improved particle swarm optimization methods , 2021, Research in the Mathematical Sciences.

[142]  Alexander Wong,et al.  COVID-Net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images , 2020, Scientific reports.

[143]  S. Portnoy Asymptotic behavior of M-estimators of p regression parameters when p , 1985 .

[144]  Shane G. Henderson,et al.  Safe blues: a method for estimation and control in the fight against COVID-19 , 2020 .

[145]  Fotios Petropoulos,et al.  Forecasting the novel coronavirus COVID-19 , 2020, PloS one.

[146]  Kay Chen Tan,et al.  Competitive swarm optimizer with mutated agents for finding optimal designs for nonlinear regression models with multiple interacting factors , 2020, Memetic Comput..