Jackknife Model Averaging Prediction Methods for Complex Phenotypes with Gene Expression Levels by Integrating External Pathway Information

Motivation In the past few years many novel prediction approaches have been proposed and widely employed in high dimensional genetic data for disease risk evaluation. However, those approaches typically ignore in model fitting the important group structures or functional classifications that naturally exists in genetic data. Methods In the present study, we applied a novel model averaging approach, called Jackknife Model Averaging Prediction (JMAP), for high dimensional genetic risk prediction while incorporating KEGG pathway information into the model specification. JMAP selects the optimal weights across candidate models by minimizing a cross-validation criterion in a jackknife way. Compared with previous approaches, one of the primary features of JMAP is to allow model weights to vary from 0 to 1 but without the limitation that the summation of weights is equal to one. We evaluated the performance of JMAP using extensive simulation studies and compared it with existing methods. We finally applied JMAP to five real cancer datasets that are publicly available from TCGA. Results The simulations showed that, compared with other existing approaches, JMAP performed best or are among the best methods across a range of scenarios. For example, among 14 out of 16 simulation settings with PVE=0.3, JMAP has an average of 0.075 higher prediction accuracy compared with gsslasso. We further found that in the simulation the model weights for the true candidate models have much smaller chances to be zero compared with those for the null candidate models and are substantially greater in magnitude. In the real data application, JMAP also behaves comparably or better compared with the other methods for both continuous and binary phenotypes. For example, for the COAD, CRC and PAAD data sets, the average gains of predictive accuracy of JMAP are 0.019, 0.064 and 0.052 compared with gsslasso. Conclusion The proposed method JMAP is a novel method that can provide more accurate phenotypic prediction while incorporating external useful group information.

[1]  Kaanan P. Shah,et al.  A gene-based association method for mapping traits using reference transcriptome data , 2015, Nature Genetics.

[2]  Guangchuang Yu,et al.  clusterProfiler: an R package for comparing biological themes among gene clusters. , 2012, Omics : a journal of integrative biology.

[3]  Benjamin Neale,et al.  Transcriptome-wide association study of schizophrenia and chromatin activity yields mechanistic disease insights , 2016 .

[4]  Stephen C. J. Parker,et al.  The genetic architecture of type 2 diabetes , 2016, Nature.

[5]  Xiang Zhou,et al.  Prediction of gene expression with cis-SNPs using mixed models and regularization methods , 2017, BMC Genomics.

[6]  W. Pan,et al.  A Powerful Pathway-Based Adaptive Test for Genetic Association with Common or Rare Variants. , 2015, American journal of human genetics.

[7]  Kai Wang,et al.  Pathway-based approaches for analysis of genomewide association studies. , 2007, American journal of human genetics.

[8]  P. Visscher,et al.  10 Years of GWAS Discovery: Biology, Function, and Translation. , 2017, American journal of human genetics.

[9]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[10]  C. E. Pearson,et al.  Table S2: Trans-factors and trinucleotide repeat instability Trans-factor , 2010 .

[11]  J. Hirschhorn,et al.  Biological interpretation of genome-wide association studies using predicted gene functions , 2015, Nature Communications.

[12]  Dan Geiger,et al.  Multikernel linear mixed models for complex phenotype prediction , 2016, Genome research.

[13]  Jeffrey S. Racine,et al.  Jackknife model averaging , 2012 .

[14]  Peter W. Laird,et al.  Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer , 2018, Cell.

[15]  P. Visscher,et al.  Simultaneous Discovery, Estimation and Prediction Analysis of Complex Traits Using a Bayesian Mixture Model , 2015, PLoS genetics.

[16]  F. Wei,et al.  Construction of a set of novel and robust gene expression signatures predicting prostate cancer recurrence , 2018, Molecular oncology.

[17]  Doug Speed,et al.  MultiBLUP: improved SNP-based prediction for complex traits , 2014, Genome research.

[18]  D. Goldstein,et al.  Uncovering the roles of rare variants in common disease through whole-genome sequencing , 2010, Nature Reviews Genetics.

[19]  Annelot M. Dekker,et al.  Genome-wide association analyses identify new risk variants and the genetic architecture of amyotrophic lateral sclerosis , 2017 .

[20]  Ting Wang,et al.  Likelihood Ratio Tests in Rare Variant Detection for Continuous Phenotypes , 2014, Annals of human genetics.

[21]  Wei Liu,et al.  Joint modeling of genetically correlated diseases and functional annotations increases accuracy of polygenic risk prediction , 2017, PLoS genetics.

[22]  Jianxin Shi,et al.  Developing and evaluating polygenic risk prediction models for stratified disease prevention , 2016, Nature Reviews Genetics.

[23]  M. Daly,et al.  Genetic Mapping in Human Disease , 2008, Science.

[24]  Xiang Zhou,et al.  Polygenic Modeling with Bayesian Sparse Linear Mixed Models , 2012, PLoS genetics.

[25]  Doheon Lee,et al.  Inferring Pathway Activity toward Precise Disease Classification , 2008, PLoS Comput. Biol..

[26]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[27]  Qihua Wang,et al.  Stable prediction in high-dimensional linear models , 2017, Stat. Comput..

[28]  Ker-Chau Li,et al.  A Model-Averaging Approach for High-Dimensional Regression , 2014 .

[29]  J. Ho,et al.  Performance of a prognostic 31-gene expression profile in an independent cohort of 523 cutaneous melanoma patients , 2018, BMC Cancer.

[30]  Giovanni Montana,et al.  Statistical Applications in Genetics and Molecular Biology Fast Identification of Biological Pathways Associated with a Quantitative Trait Using Group Lasso with Overlaps , 2012 .

[31]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[32]  Daniel Gianola,et al.  Predicting genetic predisposition in humans: the promise of whole-genome markers , 2010, Nature Reviews Genetics.

[33]  D. Allison,et al.  Beyond Missing Heritability: Prediction of Complex Traits , 2011, PLoS genetics.

[34]  T. Salakoski,et al.  Regularized Machine Learning in the Genetic Prediction of Complex Traits , 2014, PLoS genetics.

[35]  R. Tibshirani,et al.  A note on the group lasso and a sparse group lasso , 2010, 1001.0736.

[36]  Minoru Kanehisa,et al.  KEGG as a reference resource for gene and protein annotation , 2015, Nucleic Acids Res..

[37]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[38]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[39]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[40]  Corrigendum: Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories , 2014, Nature Biotechnology.

[41]  Steven J. M. Jones,et al.  Pathogenic Germline Variants in 10,389 Adult Cancers. , 2018, Cell.

[42]  Xia Yang,et al.  Integrating pathway analysis and genetics of gene expression for genome-wide association studies. , 2010, American journal of human genetics.

[43]  Guohua Zou,et al.  Least squares model averaging by Mallows criterion , 2010 .

[44]  Han Xu,et al.  Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. , 2014, American journal of human genetics.

[45]  Hongyu Zhao,et al.  Leveraging functional annotations in genetic risk prediction for human complex diseases , 2016, bioRxiv.

[46]  T. Ideker,et al.  Network-based classification of breast cancer metastasis , 2007, Molecular systems biology.

[47]  P. Zeng,et al.  Cis-SNPs Set Testing and PrediXcan Analysis for Gene Expression Data using Linear Mixed Models , 2017, Scientific Reports.

[48]  Xiang Zhou,et al.  Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models , 2017, Nature Communications.

[49]  T. Lehtimäki,et al.  Integrative approaches for large-scale transcriptome-wide association studies , 2015, Nature Genetics.

[50]  汤在祥 Group spike-and-slab lasso generalized linear models for disease prediction and associated genes detection by incorporating pathway , 2017 .

[51]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[52]  Jeroen F. J. Laros,et al.  Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories , 2013, Nature Biotechnology.

[53]  Hua Liang,et al.  Model averaging and weight choice in linear mixed-effects models , 2014 .

[54]  Nengjun Yi,et al.  Group spike‐and‐slab lasso generalized linear models for disease prediction and associated genes detection by incorporating pathway information , 2018, Bioinform..

[55]  Jeffery M. Meyer,et al.  A transcriptome-wide association study of 229,000 women identifies new candidate susceptibility genes for breast cancer , 2018, Nature Genetics.

[56]  Ker-Chau Li,et al.  A weight-relaxed model averaging approach for high-dimensional generalized linear models , 2017 .

[57]  Tanya M. Teslovich,et al.  Discovery and refinement of loci associated with lipid levels , 2013, Nature Genetics.

[58]  Y. Nakanishi,et al.  Incidence of lymph node metastasis from early gastric cancer: estimation with a large number of cases at two large centers , 2000, Gastric Cancer.

[59]  Yakir A Reshef,et al.  Partitioning heritability by functional annotation using genome-wide association summary statistics , 2015, Nature Genetics.

[60]  Xihong Lin,et al.  Rare-variant association testing for sequencing data with the sequence kernel association test. , 2011, American journal of human genetics.

[61]  Helen E. Parkinson,et al.  The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) , 2016, Nucleic Acids Res..

[62]  Selection of Compound Group to Identify the Authenticity One of Jamu Product Using The Group Lasso for Logistic Regression , 2019, Journal of Physics: Conference Series.

[63]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[64]  Alexander Gusev,et al.  Transcriptome-wide association study of schizophrenia and chromatin activity yields mechanistic disease insights , 2016, Nature Genetics.

[65]  Susumu Goto,et al.  Data, information, knowledge and principle: back to metabolism in KEGG , 2013, Nucleic Acids Res..

[66]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[67]  Nilanjan Chatterjee,et al.  Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies , 2013, Nature Genetics.