Learning on complex, biased, and big data: disease risk prediction in epidemiological studies and genomic medicine on the example of childhood asthma

Predicting the risk of complex diseases is a field of growing relevance in medicine and shows high potential of refinement and improvement by integrating new data types and larger data sets. In this thesis, we investigate and overcome issues on several challenges in this field by applying and developing statistical methodology for working with complex data structures and we show how prediction for diseases can be improved by taking into account bias, complexity and bigness of data.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[3]  T. Roumeliotaki,et al.  Variations in the prevalence of childhood asthma and wheeze in MeDALL cohorts in Europe , 2017, ERJ Open Research.

[4]  S. Alberti,et al.  Epigenetic inheritance and the missing heritability , 2015, Human Genomics.

[5]  A. Dupuy,et al.  Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. , 2007, Journal of the National Cancer Institute.

[6]  Robert P. W. Duin,et al.  Bagging for linear classifiers , 1998, Pattern Recognit..

[7]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[8]  Jenny Donovan,et al.  Evaluating the Prostate Cancer Prevention Trial High Grade prostate cancer risk calculator in 10 international biopsy cohorts: results from the prostate biopsy collaborative group , 2014, World journal of urology.

[9]  Amy L. McGuire,et al.  Personalized genomic information: preparing for the future of genetic medicine , 2010, Nature Reviews Genetics.

[10]  Nicholas Eriksson,et al.  Comparison of Family History and SNPs for Predicting Risk of Complex Disease , 2012, PLoS genetics.

[11]  C. Calì,et al.  Some mathematical properties of the ROC curve and their applications , 2015 .

[12]  Rossen I. Valkanov,et al.  Boundaries of Predictability: Noisy Predictive Regressions , 2000 .

[13]  Craig K Enders,et al.  A 'missing not at random' (MNAR) and 'missing at random' (MAR) growth model comparison with a buprenorphine/naloxone clinical trial. , 2015, Addiction.

[14]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[15]  Kenneth F Schulz,et al.  Refining clinical diagnosis with likelihood ratios , 2005, The Lancet.

[16]  D. Vercelli Gene–environment interactions in asthma and allergy: the end of the beginning? , 2010, Current opinion in allergy and clinical immunology.

[17]  Ewout W Steyerberg,et al.  Validation and updating of predictive logistic regression models: a study on sample size and shrinkage , 2004, Statistics in medicine.

[18]  A. Hall,et al.  The Rho Target PRK2 Regulates Apical Junction Formation in Human Bronchial Epithelial Cells , 2010, Molecular and Cellular Biology.

[19]  Q. Deng,et al.  Single-cell RNA sequencing: Technical advancements and biological applications. , 2017, Molecular aspects of medicine.

[20]  Jürgen Unützer,et al.  A comparison of imputation methods in a longitudinal randomized clinical trial , 2005, Statistics in medicine.

[21]  H. Ortega,et al.  Role of local eosinophilopoietic processes in the development of airway eosinophilia in prednisone‐dependent severe asthma , 2016, Clinical and experimental allergy : journal of the British Society for Allergy and Clinical Immunology.

[22]  J E White,et al.  A two stage design for the study of the relationship between a rare exposure and a rare disease. , 1982, American journal of epidemiology.

[23]  Johanna M Seddon,et al.  Prediction model for prevalence and incidence of advanced age-related macular degeneration based on genetic, demographic, and environmental variables. , 2009, Investigative ophthalmology & visual science.

[24]  Anne-Laure Boulesteix,et al.  Added predictive value of high-throughput molecular data to clinical data and its validation , 2011, Briefings Bioinform..

[25]  Jamis J. Perrett,et al.  Bonferroni Adjustments in Tests for Regression Coefficients , 2006 .

[26]  Thomas Lengauer,et al.  Permutation importance: a corrected feature importance measure , 2010, Bioinform..

[27]  Helen E. Parkinson,et al.  The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) , 2016, Nucleic Acids Res..

[28]  D. Belsky,et al.  Polygenic risk and the development and course of asthma: an analysis of data from a four-decade longitudinal study. , 2013, The Lancet. Respiratory medicine.

[29]  T. Illig,et al.  Identification of novel immune phenotypes for allergic and nonallergic childhood asthma. , 2015, The Journal of allergy and clinical immunology.

[30]  Margaret Sullivan Pepe,et al.  Assessing risk prediction models in case–control studies using semiparametric and nonparametric methods , 2010, Statistics in medicine.

[31]  Anne-Laure Boulesteix,et al.  Over-optimism in bioinformatics: an illustration , 2010, Bioinform..

[32]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[33]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[34]  Fabian J Theis,et al.  Feature ranking of type 1 diabetes susceptibility genes improves prediction of type 1 diabetes , 2014, Diabetologia.

[35]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[36]  Lisa G. Johnston,et al.  An Empirical Comparison of Respondent-driven Sampling, Time Location Sampling, and Snowball Sampling for Behavioral Surveillance in Men Who Have Sex with Men, Fortaleza, Brazil , 2008, AIDS and Behavior.

[37]  J. Gris,et al.  Polymorphisms of human placental alkaline phosphatase are associated with in vitro fertilization success and recurrent pregnancy loss. , 2014, The American journal of pathology.

[38]  Bhramar Mukherjee,et al.  Current Challenges and New Opportunities for Gene-Environment Interaction Studies of Complex Diseases. , 2017, American journal of epidemiology.

[39]  P. Lichtenstein,et al.  Heritability and confirmation of genetic association studies for childhood asthma in twins , 2016, Allergy.

[40]  Kurt Hornik,et al.  Misc Functions of the Department of Statistics, ProbabilityTheory Group (Formerly: E1071), TU Wien , 2015 .

[41]  F. Collins,et al.  Shattuck lecture--medical and societal consequences of the Human Genome Project. , 1999, The New England journal of medicine.

[42]  M. Ege Asthma and Prenatal Inflammation. , 2017, American journal of respiratory and critical care medicine.

[43]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .

[44]  A. Mccarthy Development , 1996, Current Opinion in Neurobiology.

[45]  Peter M Visscher,et al.  Prediction of individual genetic risk to disease from genome-wide association studies. , 2007, Genome research.

[46]  Gary King,et al.  Logistic Regression in Rare Events Data , 2001, Political Analysis.

[47]  Janet Stocks,et al.  An official American Thoracic Society/European Respiratory Society statement: pulmonary function testing in preschool children. , 2007, American journal of respiratory and critical care medicine.

[48]  Mark J. van der Laan,et al.  A Note on Risk Prediction for Case-Control Studies , 2008 .

[49]  Fabian J Theis,et al.  A strategy for combining minor genetic susceptibility genes to improve prediction of disease in type 1 diabetes , 2012, Genes and Immunity.

[50]  Richard G. F. Visser,et al.  Integration of multi-omics data for prediction of phenotypic traits using random forest , 2016, BMC Bioinformatics.

[51]  D. Duffy,et al.  Genetics of asthma and hay fever in Australian twins. , 1990, The American review of respiratory disease.

[52]  Ludwig Fahrmeir,et al.  Regression: Models, Methods and Applications , 2013 .

[53]  Zenghui Wang,et al.  Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review , 2017, Neural Computation.

[54]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[55]  F. Dudbridge,et al.  Estimation of significance thresholds for genomewide association scans , 2008, Genetic epidemiology.

[56]  Miyoung Shin,et al.  Developing disease risk prediction model based on environmental factors , 2014, The 18th IEEE International Symposium on Consumer Electronics (ISCE 2014).

[57]  Andreas Ziegler,et al.  Risk estimation and risk prediction using machine-learning methods , 2012, Human Genetics.

[58]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[59]  R. Erasmus,et al.  Genomic medicine and risk prediction across the disease spectrum , 2015, Critical reviews in clinical laboratory sciences.

[60]  J. Parsons,et al.  Src family protein tyrosine kinases: cooperating with growth factor and adhesion signaling pathways. , 1997, Current opinion in cell biology.

[61]  Naomi R. Wray,et al.  Estimating Trait Heritability , 2008 .

[62]  D. Strachan,et al.  Gene-environment interaction for childhood asthma and exposure to farming in Central Europe. , 2011, The Journal of allergy and clinical immunology.

[63]  Isabelle Guyon,et al.  A Scaling Law for the Validation-Set Training-Set Size Ratio , 1997 .

[64]  James J Schlesselman Case-Control Studies: Design, Conduct, Analysis , 1982 .

[65]  Matthew Nahorniak,et al.  Using Inverse Probability Bootstrap Sampling to Eliminate Sample Induced Bias in Model Based Analysis of Unequal Probability Samples , 2015, PloS one.

[66]  W. Phipatanakul,et al.  Utility of the Asthma Predictive Index in predicting childhood asthma and identifying disease-modifying interventions. , 2014, Annals of allergy, asthma & immunology : official publication of the American College of Allergy, Asthma, & Immunology.

[67]  Jason H. Moore,et al.  Chapter 11: Genome-Wide Association Studies , 2012, PLoS Comput. Biol..

[68]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[69]  W. Busse,et al.  Endotypes of difficult-to-control asthma in inner-city African American children , 2017, PloS one.

[70]  Constantine Frangakis,et al.  Multiple imputation by chained equations: what is it and how does it work? , 2011, International journal of methods in psychiatric research.

[71]  Axel Benner,et al.  Integrating multiple molecular sources into a clinical risk prediction signature by extracting complementary information , 2016, BMC Bioinformatics.

[72]  G. Anderson,et al.  Endotyping asthma: new insights into key pathogenic mechanisms in a complex, heterogeneous disease , 2008, The Lancet.

[73]  Juha Karvanen,et al.  Secondary Analysis under Cohort Sampling Designs Using Conditional Likelihood , 2012 .

[74]  Stef van Buuren,et al.  A toolkit in SAS for the evaluation of multiple imputation methods , 2003 .

[75]  Johnny S. H. Kwan,et al.  Risk prediction of complex diseases from family history and known susceptibility loci, with applications for cancer screening. , 2011, American journal of human genetics.

[76]  B. Schaub,et al.  The puzzle of immune phenotypes of childhood asthma , 2016, Molecular and Cellular Pediatrics.

[77]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[78]  øöö Blockinøø Well-Trained PETs : Improving Probability Estimation , 2000 .

[79]  W. DuMouchel,et al.  Using Sample Survey Weights in Multiple Regression Analyses of Stratified Samples , 1983 .

[80]  J. Alcorn,et al.  A Multiomics Approach to Identify Genes Associated with Childhood Asthma Risk and Morbidity , 2017, American journal of respiratory cell and molecular biology.

[81]  Xiaoyu Jiang,et al.  IPF-LASSO: Integrative L 1-Penalized Regression with Penalty Factors for Prediction Based on Multi-Omics Data , 2017, Comput. Math. Methods Medicine.

[82]  F. Agakov,et al.  Genomic prediction of complex human traits: relatedness, trait architecture and predictive meta-models , 2015, Human molecular genetics.

[83]  A. Price,et al.  Dissecting the genetics of complex traits using summary association statistics , 2016, Nature Reviews Genetics.

[84]  Stef van Buuren,et al.  Flexible Imputation of Missing Data , 2012 .

[85]  Anne-Laure Boulesteix,et al.  A computationally fast variable importance test for random forests for high-dimensional data , 2015, Adv. Data Anal. Classif..

[86]  J. Catania,et al.  Health-related characteristics of men who have sex with men: a comparison of those living in "gay ghettos" with those living elsewhere. , 2001, American journal of public health.

[87]  Fabian J. Theis,et al.  Unbiased Prediction and Feature Selection in High-Dimensional Survival Regression , 2016, J. Comput. Biol..

[88]  J. Heckman Sample selection bias as a specification error , 1979 .

[89]  Siti Mariyam Shamsuddin,et al.  Classification with class imbalance problem: A review , 2015, SOCO 2015.

[90]  C. Reinero,et al.  The potential use of tyrosine kinase inhibitors in severe asthma , 2012, Current opinion in allergy and clinical immunology.

[91]  O. Stegle,et al.  Deep learning for computational biology , 2016, Molecular systems biology.

[92]  Dacheng Tao,et al.  A Survey on Multi-view Learning , 2013, ArXiv.

[93]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[94]  E. Ashley Towards precision medicine , 2016, Nature Reviews Genetics.

[95]  E. Bleecker,et al.  Genome-wide association study of asthma identifies RAD50-IL13 and HLA-DR/DQ regions. , 2010, The Journal of allergy and clinical immunology.

[96]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[97]  C. Ober Asthma Genetics in the Post-GWAS Era. , 2016, Annals of the American Thoracic Society.

[98]  Ken P Kleinman,et al.  Much Ado About Nothing , 2007, The American statistician.

[99]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[100]  Sylvia Richardson,et al.  JAM: A Scalable Bayesian Framework for Joint Analysis of Marginal SNP Effects , 2016, Genetic epidemiology.

[101]  G. Abecasis,et al.  MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes , 2010, Genetic epidemiology.

[102]  Edgar Wingender,et al.  Connecting high-dimensional mRNA and miRNA expression data for binary medical classification problems , 2013, Comput. Methods Programs Biomed..

[103]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[104]  Li Li,et al.  Deep Learning to Predict Patient Future Diseases from the Electronic Health Records , 2016, ECIR.

[105]  Sergey Plis,et al.  Deep Learning Applications for Predicting Pharmacological Properties of Drugs and Drug Repurposing Using Transcriptomic Data. , 2016, Molecular pharmaceutics.

[106]  Raphael Gottardo,et al.  Orchestrating high-throughput genomic analysis with Bioconductor , 2015, Nature Methods.

[107]  R. Terracciano,et al.  Benralizumab in the treatment of severe asthma: design, development and potential place in therapy , 2018, Drug design, development and therapy.

[108]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[109]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[110]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[111]  Yvonne Vergouwe,et al.  A simple method to adjust clinical prediction models to local circumstances , 2009, Canadian journal of anaesthesia = Journal canadien d'anesthesie.

[112]  Chris J. Skinner,et al.  Analysis of complex surveys , 1991 .

[113]  Tobi Saidel,et al.  Baseline integrated behavioural and biological assessment among most at-risk populations in six high-prevalence states of India: design and implementation challenges , 2008, AIDS.

[114]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[115]  Yang I Li,et al.  An Expanded View of Complex Traits: From Polygenic to Omnigenic , 2017, Cell.

[116]  R. Serfozo Basics of Applied Stochastic Processes , 2012 .

[117]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[118]  Florence Demenais,et al.  A large-scale, consortium-based genomewide association study of asthma. , 2010, The New England journal of medicine.

[119]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[120]  Fabian J Theis,et al.  Prediction of type 1 diabetes using a genetic risk model in the Diabetes Autoimmunity Study in the Young , 2018, Pediatric diabetes.

[121]  U. Frey,et al.  Farming environments and childhood atopy, wheeze, lung function, and exhaled nitric oxide. , 2012, The Journal of allergy and clinical immunology.

[122]  Lori J Sokoll,et al.  Prostate Cancer Prevention Trial risk calculator 2.0 for the prediction of low- vs high-grade prostate cancer. , 2014, Urology.

[123]  E. Kerwin,et al.  Randomized, double-blind, placebo-controlled study of brodalumab, a human anti-IL-17 receptor monoclonal antibody, in moderate to severe asthma. , 2013, American journal of respiratory and critical care medicine.

[124]  Hans-Peter Piepho,et al.  A comparison of random forests, boosting and support vector machines for genomic selection , 2011, BMC proceedings.

[125]  Fabian J. Theis,et al.  Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies , 2017, Comput. Math. Methods Medicine.

[126]  Ian Davidson,et al.  On Sample Selection Bias and Its Efficient Correction via Model Averaging and Unlabeled Examples , 2007, SDM.

[127]  N. Terry,et al.  The Emergence of National Electronic Health Record Architectures in the United States and Australia: Models, Costs, and Questions , 2005, Journal of medical Internet research.

[128]  Ping Zhang,et al.  Risk Prediction with Electronic Health Records: A Deep Learning Approach , 2016, SDM.

[129]  Carole Ober,et al.  Gene-environment interactions in human disease: nuisance or opportunity? , 2011, Trends in genetics : TIG.

[130]  D. Ankerst,et al.  Three general concepts to improve risk prediction: good data, wisdom of the crowd, recalibration , 2016 .

[131]  C. Ober,et al.  Asthma genetics 2006: the long and winding road to gene discovery , 2006, Genes and Immunity.

[132]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[133]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[134]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.

[135]  Thomas Lumley,et al.  Analysis of Complex Survey Samples , 2004 .

[136]  Sylvia Richardson,et al.  Evolutionary Stochastic Search for Bayesian model exploration , 2010, 1002.2706.

[137]  Mathias Fuchs,et al.  Minimization and estimation of the variance of prediction errors for cross-validation designs , 2016 .

[138]  J. Castro‐Rodriguez,et al.  A clinical index to define risk of asthma in young children with recurrent wheezing. , 2000, American journal of respiratory and critical care medicine.

[139]  W. A. Clark,et al.  Simulation of self-organizing systems by digital computer , 1954, Trans. IRE Prof. Group Inf. Theory.

[140]  K. Shortman,et al.  Flow cytometry and cell-separation procedures. , 1991, Current opinion in immunology.

[141]  R. Little Missing-Data Adjustments in Large Surveys , 1988 .

[142]  Deepayan Sarkar,et al.  Lattice: Multivariate Data Visualization with R , 2008 .

[143]  P. Thompson,et al.  Histone Modifications and Asthma. The Interface of the Epigenetic and Genetic Landscapes. , 2015, American journal of respiratory cell and molecular biology.

[144]  Greg Ridgeway,et al.  Generalized Boosted Models: A guide to the gbm package , 2006 .

[145]  V. Tremaroli,et al.  Resource Dynamics and Stabilization of the Human Gut Microbiome during the First Year of Life Graphical Abstract Highlights , 2022 .

[146]  D. Vercelli,et al.  Discovering susceptibility genes for asthma and allergy , 2008, Nature Reviews Immunology.

[147]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[148]  C. Begg,et al.  Two‐Stage Designs for Gene–Disease Association Studies with Sample Size Constraints , 2004, Biometrics.

[149]  J. Friedman Stochastic gradient boosting , 2002 .

[150]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[151]  Hongyu Zhao,et al.  Practical Issues in Building Risk-Predicting Models for Complex Diseases , 2010, Journal of biopharmaceutical statistics.

[152]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[153]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[154]  J. Genuneit Sex-specific development of asthma differs between farm and nonfarm children: a cohort study. , 2014, American journal of respiratory and critical care medicine.

[155]  M. Gail,et al.  Strategies for Developing Prediction Models From Genome‐Wide Association Studies , 2013, Genetic epidemiology.

[156]  David P. Strachan,et al.  Comparisons of power of statistical methods for gene–environment interaction analyses , 2013, European Journal of Epidemiology.

[157]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[158]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[159]  K. Rabe,et al.  Oral Glucocorticoid–Sparing Effect of Benralizumab in Severe Asthma , 2017, The New England journal of medicine.

[160]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.