Ensemble machine learning methods in screening electronic health records: A scoping review

Background Electronic health records provide the opportunity to identify undiagnosed individuals likely to have a given disease using machine learning techniques, and who could then benefit from more medical screening and case finding, reducing the number needed to screen with convenience and healthcare cost savings. Ensemble machine learning models combining multiple prediction estimates into one are often said to provide better predictive performances than non-ensemble models. Yet, to our knowledge, no literature review summarises the use and performances of different types of ensemble machine learning models in the context of medical pre-screening. Method We aimed to conduct a scoping review of the literature reporting the derivation of ensemble machine learning models for screening of electronic health records. We searched EMBASE and MEDLINE databases across all years applying a formal search strategy using terms related to medical screening, electronic health records and machine learning. Data were collected, analysed, and reported in accordance with the PRISMA scoping review guideline. Results A total of 3355 articles were retrieved, of which 145 articles met our inclusion criteria and were included in this study. Ensemble machine learning models were increasingly employed across several medical specialties and often outperformed non-ensemble approaches. Ensemble machine learning models with complex combination strategies and heterogeneous classifiers often outperformed other types of ensemble machine learning models but were also less used. Ensemble machine learning models methodologies, processing steps and data sources were often not clearly described. Conclusions Our work highlights the importance of deriving and comparing the performances of different types of ensemble machine learning models when screening electronic health records and underscores the need for more comprehensive reporting of machine learning methodologies employed in clinical research.

[1]  Charlotte A. Nelson,et al.  Embedding electronic health records onto a knowledge network recognizes prodromal features of multiple sclerosis and predicts diagnosis , 2021, J. Am. Medical Informatics Assoc..

[2]  H. Merchant,et al.  Potential applications and performance of machine learning techniques and algorithms in clinical practice: A systematic review , 2021, Int. J. Medical Informatics.

[3]  D. Llewellyn,et al.  Performance of Machine Learning Algorithms for Predicting Progression to Dementia in Memory Clinic Patients , 2021, JAMA network open.

[4]  J. Kors,et al.  Trends in the conduct and reporting of clinical prediction model development and validation: a systematic review , 2021, medRxiv.

[5]  A. Kankanhalli,et al.  National electronic health records implementation: a tale with a happy ending? , 2020 .

[6]  Catherine Winder,et al.  Team Guiding Production of Volume 1 , 2005 .

[7]  D. Clifton,et al.  Predicting atrial fibrillation in primary care using machine learning , 2019, PloS one.

[8]  Renan Soares de Andrades,et al.  Hyperparameter Tuning and its Effects on Cardiac Arrhythmia Prediction , 2019, 2019 8th Brazilian Conference on Intelligent Systems (BRACIS).

[9]  Mohammad Ali Moni,et al.  Use of Electronic Health Data for Disease Prediction: A Comprehensive Literature Review , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Brad N. Greenwood,et al.  The Digitization of Patient Care: A Review of the Effects of Electronic Health Records on Health Care Quality and Utilization. , 2019, Annual review of public health.

[11]  J. McGowan,et al.  PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation , 2018, Annals of Internal Medicine.

[12]  Shulin Wang,et al.  Feature selection in machine learning: A new perspective , 2018, Neurocomputing.

[13]  Jimeng Sun,et al.  Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review , 2018, J. Am. Medical Informatics Assoc..

[14]  Gary S Collins,et al.  Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement , 2015, BMC Medicine.

[15]  A. Bottle,et al.  Systematic Review of Comorbidity Indices for Administrative Data , 2012, Medical care.

[16]  B. Krawczyk,et al.  Ensemble fusion methods for medical data classification , 2012, 11th Symposium on Neural Network Applications in Electrical Engineering.

[17]  Lior Rokach,et al.  Taxonomy for characterizing ensemble methods in classification tasks: A review and annotated bibliography , 2009, Comput. Stat. Data Anal..

[18]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[19]  Peng Liu,et al.  A Quantitative Study of the Effect of Missing Data in Classifiers , 2005, The Fifth International Conference on Computer and Information Technology (CIT'05).

[20]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[21]  Bernard Zenko,et al.  Is Combining Classifiers with Stacking Better than Selecting the Best One? , 2004, Machine Learning.

[22]  Francis K. H. Quek,et al.  Attribute bagging: improving accuracy of classifier ensembles by using random feature subsets , 2003, Pattern Recognit..

[23]  L. Breiman Random Forests , 2001, Encyclopedia of Machine Learning and Data Mining.

[24]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Lawrence D. Jackel,et al.  Limits on Learning Machine Accuracy Imposed by Data Quality , 1995, KDD.

[26]  OUP accepted manuscript , 2021, Journal of the American Medical Informatics Association.

[27]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[28]  Peter Wittek,et al.  2 – Machine Learning , 2014 .

[29]  Nazri Mohd Nawi,et al.  The Effect of Data Pre-processing on Optimized Training of Artificial Neural Networks , 2013 .

[30]  M. J. van der Laan,et al.  Statistical Applications in Genetics and Molecular Biology Super Learner , 2010 .

[31]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[32]  Antonio Pepe,et al.  Computer Methods and Programs in Biomedicine , 2022 .