A Rigorous Machine Learning Analysis Pipeline for Biomedical Binary Classification: Application in Pancreatic Cancer Nested Case-control Studies with Implications for Bias Assessments

Machine learning (ML) offers a collection of powerful approaches for detecting and modeling associations, often applied to data having a large number of features and/or complex associations. Currently, there are many tools to facilitate implementing custom ML analyses (e.g. scikit-learn). Interest is also increasing in automated ML packages, which can make it easier for non-experts to apply ML and have the potential to improve model performance. ML permeates most subfields of biomedical research with varying levels of rigor and correct usage. Tremendous opportunities offered by ML are frequently offset by the challenge of assembling comprehensive analysis pipelines, and the ease of ML misuse. In this work we have laid out and assembled a complete, rigorous ML analysis pipeline focused on binary classification (i.e. case/control prediction), and applied this pipeline to both simulated and real world data. At a high level, this 'automated' but customizable pipeline includes a) exploratory analysis, b) data cleaning and transformation, c) feature selection, d) model training with 9 established ML algorithms, each with hyperparameter optimization, and e) thorough evaluation, including appropriate metrics, statistical analyses, and novel visualizations. This pipeline organizes the many subtle complexities of ML pipeline assembly to illustrate best practices to avoid bias and ensure reproducibility. Additionally, this pipeline is the first to compare established ML algorithms to 'ExSTraCS', a rule-based ML algorithm with the unique capability of interpretably modeling heterogeneous patterns of association. While designed to be widely applicable we apply this pipeline to an epidemiological investigation of established and newly identified risk factors for pancreatic cancer to evaluate how different sources of bias might be handled by ML algorithms.

[1]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[2]  Jason H. Moore,et al.  Learning classifier systems: a complete introduction, review, and roadmap , 2009 .

[3]  Ying Wang,et al.  A genome-wide association study of lung cancer identifies a region of chromosome 5p15 associated with risk for adenocarcinoma. , 2009, American journal of human genetics.

[4]  R. Turkington,et al.  Pancreatic cancer: A review of clinical diagnosis, epidemiology, treatment and outcomes , 2018, World journal of gastroenterology.

[5]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[6]  Randal S. Olson,et al.  TPOT: A Tree-based Pipeline Optimization Tool for Automating Machine Learning , 2016, AutoML@ICML.

[7]  Patrick Royston,et al.  Avoiding bias due to perfect prediction in multiple imputation of incomplete categorical variables☆ , 2010, Comput. Stat. Data Anal..

[8]  Suneetha Uppu,et al.  Tuning Hyperparameters for Gene Interaction Models in Genome-Wide Association Studies , 2017, ICONIP.

[9]  Taghi M. Khoshgoftaar,et al.  Deep learning applications and challenges in big data analytics , 2015, Journal of Big Data.

[10]  E. LeDell,et al.  H2O AutoML: Scalable Automatic Machine Learning , 2020 .

[11]  S. Rauschert,et al.  Machine learning and clinical epigenetics: a review of challenges for diagnosis and classification , 2020, Clinical Epigenetics.

[12]  J. Rassen,et al.  Confounding Control in Healthcare Database Research: Challenges and Potential Approaches , 2010, Medical care.

[13]  Xinyuan Zhang,et al.  Collective feature selection to identify crucial epistatic variants , 2018, BioData Mining.

[14]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[15]  Anwar Haque,et al.  Large-scale machine learning based on functional networks for biomedical big data with high performance computing platforms , 2015, J. Comput. Sci..

[16]  Jason H. Moore,et al.  Rapid Rule Compaction Strategies for Global Knowledge Discovery in a Supervised Learning Classifier System , 2013, ECAL.

[17]  B. K. Tripathy,et al.  Evaluation of Classifier Models Using Stratified Tenfold Cross Validation Techniques , 2011 .

[18]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[19]  Alexander Brenning,et al.  Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data , 2019, Ecological Modelling.

[20]  Binal. A. Thakkar,et al.  Health Care Decision Support System for Swine Flu Prediction Using Naïve Bayes Classifier , 2010, 2010 International Conference on Advances in Recent Technologies in Communication and Computing.

[21]  Jason H. Moore,et al.  Tuning ReliefF for Genome-Wide Genetic Analysis , 2007, EvoBIO.

[22]  Wei Luo,et al.  Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View , 2016, Journal of medical Internet research.

[23]  Andreas Zell,et al.  Use of support vector machines for disease risk prediction in genome‐wide association studies: Concerns and opportunities , 2012, Human mutation.

[24]  Margaret J. Eppstein,et al.  Very large scale ReliefF for genome-wide association analysis , 2008, 2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[25]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[26]  Peggy L Peissig,et al.  Machine Learning Assisted Discovery of Novel Predictive Lab Tests Using Electronic Health Record Data. , 2019, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[27]  Takuya Akiba,et al.  Optuna: A Next-generation Hyperparameter Optimization Framework , 2019, KDD.

[28]  Renato Umeton,et al.  Automated machine learning: Review of the state-of-the-art and opportunities for healthcare , 2020, Artif. Intell. Medicine.

[29]  Isaac S Kohane,et al.  Biomedical informatics and machine learning for clinical genomics. , 2018, Human molecular genetics.

[30]  Yanjun Qi Random Forest for Bioinformatics , 2012 .

[31]  P. Simpson,et al.  Statistical methods in cancer research , 2001, Journal of surgical oncology.

[32]  Jason H. Moore,et al.  GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures , 2012, BioData Mining.

[33]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[34]  Dalwinder Singh,et al.  Investigating the impact of data normalization on classification performance , 2020, Appl. Soft Comput..

[35]  Amalia Luque,et al.  The impact of class imbalance in classification performance metrics based on the binary confusion matrix , 2019, Pattern Recognit..

[36]  Reza Farivar,et al.  Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools , 2019, 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI).

[37]  Jonathan L Haines,et al.  Genetics, statistics and human disease: analytical retooling for complexity. , 2004, Trends in genetics : TIG.

[38]  Jason H. Moore,et al.  A call for biological data mining approaches in epidemiology , 2016, BioData Mining.

[39]  Marius Lindauer,et al.  Auto-Sklearn 2.0: The Next Generation , 2020, ArXiv.

[40]  Fei Wang,et al.  Deep learning for healthcare: review, opportunities and challenges , 2018, Briefings Bioinform..

[41]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .

[42]  P C Prorok,et al.  Design of the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial. , 2000, Controlled clinical trials.

[43]  Stephen J Mooney,et al.  Commentary: Epidemiology in the Era of Big Data , 2015, Epidemiology.

[44]  Jason H. Moore,et al.  Role of genetic heterogeneity and epistasis in bladder cancer susceptibility and outcome: a learning classifier system approach , 2013, J. Am. Medical Informatics Assoc..

[45]  G. Ebers,et al.  Epistasis , 2015, Methods in Molecular Biology.

[46]  Jason H. Moore,et al.  Learning feature spaces for regression with genetic programming , 2020, Genetic Programming and Evolvable Machines.

[47]  Will N. Browne,et al.  Introduction to Learning Classifier Systems , 2017, SpringerBriefs in Intelligent Systems.

[48]  L. Amundadottir,et al.  Epidemiology and Inherited Predisposition for Sporadic Pancreatic Adenocarcinoma. , 2015, Hematology/oncology clinics of North America.

[49]  A F Subar,et al.  Evaluation of alternative approaches to assign nutrient values to food groups in food frequency questionnaires. , 2000, American journal of epidemiology.

[50]  M. Gail Statistical methods in cancer research. Volume II. The design and analysis of cohort studies. N. E. Breslow and N. E. Day, Oxford University Press for International Agency for Research on Cancer, 1987. No of pages: xii + 406. Price: £30 , 1989 .

[51]  Kemal Polat,et al.  The Effect of Training and Testing Process on Machine Learning in Biomedical Datasets , 2020, Mathematical Problems in Engineering.

[52]  R. Stewart EPIDEMIOLOGY IN THE ERA OF BIG DATA: OPPORTUNITIES AND CHALLENGES , 2018, Alzheimer's & Dementia.

[53]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Jason H. Moore,et al.  ExSTraCS 2.0: description and evaluation of a scalable learning classifier system , 2015, Evolutionary Intelligence.

[55]  Balázs Kégl,et al.  Similarity encoding for learning with dirty categorical variables , 2018, Machine Learning.

[56]  Zenghui Wang,et al.  Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review , 2017, Neural Computation.

[57]  D. Silverman,et al.  Circulating Leptin and Risk of Pancreatic Cancer: A Pooled Analysis From 3 Cohorts. , 2015, American journal of epidemiology.

[58]  Joseph Weiss,et al.  Ethical Implications of Bias in Machine Learning , 2018, HICSS.

[59]  Jason H. Moore,et al.  An analysis pipeline with statistical and visualization-guided knowledge discovery for Michigan-style learning classifier systems , 2012, IEEE Computational Intelligence Magazine.

[60]  Rafael Garcia-Dias,et al.  A step-by-step tutorial on how to build a machine learning model , 2020 .

[61]  Randal S. Olson,et al.  Benchmarking Relief-Based Feature Selection Methods , 2017, J. Biomed. Informatics.

[62]  Paul H. C. Eilers,et al.  GWAS on your notebook: fast semi-parallel linear and logistic regression for genome-wide association studies , 2013, BMC Bioinformatics.

[63]  Chih-Fong Tsai,et al.  The distance function effect on k-nearest neighbor classification for medical datasets , 2016, SpringerPlus.

[64]  Randal S. Olson,et al.  Relief-Based Feature Selection: Introduction and Review , 2017, J. Biomed. Informatics.

[65]  Yusuke Nakamura,et al.  Genome-wide association study identifies multiple susceptibility loci for pancreatic cancer , 2014, Nature Genetics.

[66]  Alex Alves Freitas,et al.  Analysing the Overfit of the Auto-sklearn Automated Machine Learning Tool , 2019, LOD.

[67]  J. Gohagan,et al.  The prostate, lung, colorectal, and ovarian cancer screening trial and its associated research resource. , 2013, Journal of the National Cancer Institute.

[68]  W. Willett,et al.  Multiple loci identified in a genome-wide association study of prostate cancer , 2008, Nature Genetics.

[69]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[70]  W. Willett,et al.  Origin, Methods, and Evolution of the Three Nurses' Health Studies. , 2016, American journal of public health.

[71]  Marylyn D. Ritchie,et al.  CLARITE Facilitates the Quality Control and Analysis Process for EWAS of Metabolic-Related Traits , 2019, Front. Genet..

[72]  Dmitrij Frishman,et al.  Pitfalls of supervised feature selection , 2009, Bioinform..

[73]  Alison A. Motsinger-Reif,et al.  Grammatical evolution decision trees for detecting gene-gene interactions , 2010, BioData Mining.

[74]  Jake Luo,et al.  Big Data Application in Biomedical Research and Health Care: A Literature Review , 2016, Biomedical informatics insights.

[75]  Andy Podgurski,et al.  The Use and Misuse of Biomedical Data: Is Bigger Really Better? , 2013, American Journal of Law & Medicine.