Predictive overfitting in immunological applications: Pitfalls and solutions

ABSTRACT Overfitting describes the phenomenon where a highly predictive model on the training data generalizes poorly to future observations. It is a common concern when applying machine learning techniques to contemporary medical applications, such as predicting vaccination response and disease status in infectious disease or cancer studies. This review examines the causes of overfitting and offers strategies to counteract it, focusing on model complexity reduction, reliable model evaluation, and harnessing data diversity. Through discussion of the underlying mathematical models and illustrative examples using both synthetic data and published real datasets, our objective is to equip analysts and bioinformaticians with the knowledge and tools necessary to detect and mitigate overfitting in their research.

[1]  S. Kleinstein,et al.  SPEAR: a Sparse Supervised Bayesian Factor Model for Multi-omic Integration , 2023, bioRxiv.

[2]  B. Pulendran,et al.  Pan-vaccine analysis reveals innate immune endotypes predictive of antibody responses to vaccination , 2022, Nature immunology.

[3]  B. Pulendran,et al.  Transcriptional atlas of the human immune response to 13 vaccines reveals a common predictor of vaccine-induced antibody responses , 2022, bioRxiv.

[4]  Leying Guan,et al.  $\ell_1$-norm constrained multi-block sparse canonical correlation analysis via proximal gradient descent , 2022, 2201.05289.

[5]  A. Soltan,et al.  Algorithmic Fairness and Bias Mitigation for Clinical Machine Learning: Insights from Rapid COVID-19 Diagnosis by Adversarial Learning , 2022, medRxiv.

[6]  B. Pulendran,et al.  The Immune Signatures data resource, a compendium of systems vaccinology datasets , 2021, bioRxiv.

[7]  Jordan L. Metcalf,et al.  Immunophenotyping assessment in a COVID-19 cohort (IMPACC): A prospective longitudinal study , 2021, Science immunology.

[8]  Chelsea Finn,et al.  Just Train Twice: Improving Group Robustness without Training Group Information , 2021, ICML.

[9]  A. Wallqvist,et al.  Immunoprofiling Correlates of Protection Against SHIV Infection in Adjuvanted HIV-1 Pox-Protein Vaccinated Rhesus Macaques , 2021, Frontiers in Immunology.

[10]  R. Tibshirani,et al.  Cross-validation: what does it estimate and how well does it do it? , 2021, Journal of the American Statistical Association.

[11]  Samuel Chaffron,et al.  MiBiOmics: an interactive web application for multi-omics data exploration and integration , 2021, BMC Bioinformatics.

[12]  Suvrit Sra,et al.  Coping with Label Shift via Distributionally Robust Optimisation , 2020, ICLR.

[13]  Yaohao Peng,et al.  An empirical overview of nonlinearity and overfitting in machine learning using COVID-19 data , 2020, Chaos, Solitons & Fractals.

[14]  J. Marioni,et al.  MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data , 2020, Genome Biology.

[15]  Eun Jeong Min,et al.  Sparse multiple co-Inertia analysis with application to integrative analysis of multi -Omics data , 2020, BMC Bioinformatics.

[16]  Peng Wang,et al.  Convolution operators for visual tracking based on spatial–temporal regularization , 2020, Neural Computing and Applications.

[17]  Tatsunori B. Hashimoto,et al.  Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization , 2019, ArXiv.

[18]  David Lopez-Paz,et al.  Invariant Risk Minimization , 2019, ArXiv.

[19]  Steven Su,et al.  High Dimensional Bayesian Optimization via Supervised Dimension Reduction , 2019, IJCAI.

[20]  Fanny Yang,et al.  Invariance-inducing regularization using worst-case transformations suffices to boost accuracy and spatial robustness , 2019, NeurIPS.

[21]  Kim-Anh Lê Cao,et al.  DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays , 2019, Bioinform..

[22]  R. Tibshirani,et al.  Supervised learning via the "hubNet" procedure. , 2018, Statistica Sinica.

[23]  Robert Tibshirani,et al.  Post model‐fitting exploration via a “Next‐Door” analysis , 2018, The Canadian journal of statistics = Revue canadienne de statistique.

[24]  J. Marioni,et al.  Multi‐Omics Factor Analysis—a framework for unsupervised integration of multi‐omics data sets , 2018, Molecular systems biology.

[25]  Rob J. Hyndman,et al.  A note on the validity of cross-validation for evaluating autoregressive time series prediction , 2018, Comput. Stat. Data Anal..

[26]  Lana X. Garmire,et al.  More Is Better: Recent Progress in Multi-Omics Data Integration Methods , 2017, Front. Genet..

[27]  Arthur Tenenhaus,et al.  Regularized Generalized Canonical Correlation Analysis: A Framework for Sequential Multiblock Component Methods , 2017, Psychometrika.

[28]  Shuzhao Li,et al.  Metabolic Phenotypes of Response to Vaccination in Humans , 2017, Cell.

[29]  A. Lusis,et al.  Multi-omics approaches to disease , 2017, Genome Biology.

[30]  Brooke L. Fridley,et al.  Integrative clustering of multi-level ‘omic data based on non-negative matrix factorization algorithm , 2017, PloS one.

[31]  João Pedro de Magalhães,et al.  Gene co-expression analysis for functional classification and gene–disease predictions , 2017, Briefings Bioinform..

[32]  Xiaoying Tian Harris Prediction error after model search , 2016, The Annals of Statistics.

[33]  Winston Haynes,et al.  Empowering Multi-Cohort Gene Expression Analysis to Increase Reproducibility , 2016, bioRxiv.

[34]  Andres Hoyos Idrobo,et al.  Assessing and tuning brain decoders: Cross-validation, caveats, and guidelines , 2016, NeuroImage.

[35]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[36]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[37]  Daniel J. Gaffney,et al.  A survey of best practices for RNA-seq data analysis , 2016, Genome Biology.

[38]  B. Pulendran,et al.  Systems vaccinology: Enabling rational vaccine design with systems biological approaches. , 2015, Vaccine.

[39]  Mark M. Davis,et al.  New approaches to understanding the immune response to vaccination and infection. , 2015, Vaccine.

[40]  George Michailidis,et al.  A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data , 2015, Bioinform..

[41]  Jonas Peters,et al.  Causal inference by using invariant prediction: identification and confidence intervals , 2015, 1501.01332.

[42]  Dimitrios I. Fotiadis,et al.  Machine learning applications in cancer prognosis and prediction , 2014, Computational and structural biotechnology journal.

[43]  Peter Bühlmann,et al.  Magging: Maximin Aggregation for Inhomogeneous Large-Scale Data , 2014, Proceedings of the IEEE.

[44]  Russell Greiner,et al.  Robust Learning under Uncertain Test Distributions: Relating Covariate Shift to Model Misspecification , 2014, ICML.

[45]  N. Meinshausen,et al.  Maximin effects in inhomogeneous large-scale data , 2014, 1406.0596.

[46]  Aedín C. Culhane,et al.  A multivariate approach to the integration of multi-omics datasets , 2014, BMC Bioinformatics.

[47]  Su-In Lee,et al.  Learning graphical models with hubs , 2014, J. Mach. Learn. Res..

[48]  Raphael Gottardo,et al.  Computational resources for high-dimensional immune analysis from the Human Immunology Project Consortium , 2014, Nature Biotechnology.

[49]  Sandra Romero-Steiner,et al.  Molecular signatures of antibody responses derived from a systems biological study of 5 human vaccines , 2013, Nature Immunology.

[50]  Jianqing Fan,et al.  QUADRO: A SUPERVISED DIMENSION REDUCTION METHOD VIA RAYLEIGH QUOTIENT OPTIMIZATION. , 2013, Annals of statistics.

[51]  Sida I. Wang,et al.  Dropout Training as Adaptive Regularization , 2013, NIPS.

[52]  P. Laird,et al.  Discovery of multi-dimensional modules by integrative analysis of cancer genomic data , 2012, Nucleic acids research.

[53]  Geoffrey E. Hinton,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[54]  C. Sander,et al.  Integrative Subtype Discovery in Glioblastoma Using iCluster , 2012, PloS one.

[55]  R. Nussenblatt,et al.  Standardizing immunophenotyping for the Human Immunology Project , 2012, Nature Reviews Immunology.

[56]  Uri Hershberg,et al.  Biomedical Model Fitting and Error Analysis , 2011, Science Signaling.

[57]  A. Tenenhaus,et al.  Regularized Generalized Canonical Correlation Analysis , 2011, Eur. J. Oper. Res..

[58]  Eric F Lock,et al.  JOINT AND INDIVIDUAL VARIATION EXPLAINED (JIVE) FOR INTEGRATED ANALYSIS OF MULTIPLE DATA TYPES. , 2011, The annals of applied statistics.

[59]  Geoffrey E. Hinton,et al.  Melting of Peridotite to 140 Gigapascals , 2010, Science.

[60]  S. Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[61]  Radford M. Neal,et al.  Pattern recognition and machine learning , 2019, Springer US.

[62]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[63]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[64]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[65]  S. Horvath,et al.  Statistical Applications in Genetics and Molecular Biology , 2011 .

[66]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[67]  R. Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[68]  Igor Kononenko,et al.  Machine learning for medical diagnosis: history, state of the art and perspective , 2001, Artif. Intell. Medicine.

[69]  Wasserman,et al.  Bayesian Model Selection and Model Averaging. , 2000, Journal of mathematical psychology.

[70]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[71]  C. Mallows Some Comments on Cp , 2000, Technometrics.

[72]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[73]  N. Draper,et al.  Applied Regression Analysis: Draper/Applied Regression Analysis , 1998 .

[74]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[75]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[76]  Christopher M. Bishop,et al.  Current address: Microsoft Research, , 2022 .

[77]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[78]  William J. Welch,et al.  Construction of Permutation Tests , 1990 .

[79]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[80]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[81]  Seymour Geisser,et al.  The Predictive Sample Reuse Method with Applications , 1975 .

[82]  J. R. Kettenring,et al.  Canonical analysis of several sets of variables , 1971 .

[83]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[84]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[85]  Marina Vannucci,et al.  A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. , 2018, Biostatistics.

[86]  Glenn Fung,et al.  On the Dangers of Cross-Validation. An Experimental Evaluation , 2008, SDM.

[87]  Zhi-Hua Zhou,et al.  Semi-Supervised Dimensionality Reduction , 2007, SDM.

[88]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[89]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[90]  Lutz Prechelt,et al.  Early Stopping-But When? , 1996, Neural Networks: Tricks of the Trade.

[91]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[92]  Chong-sun Kim Canonical Analysis of Several Sets of Variables , 1973 .