A Two-Stage Mutual Information Based Bayesian Lasso Algorithm for Multi-Locus Genome-Wide Association Studies

Genome-wide association study (GWAS) has turned out to be an essential technology for exploring the genetic mechanism of complex traits. To reduce the complexity of computation, it is well accepted to remove unrelated single nucleotide polymorphisms (SNPs) before GWAS, e.g., by using iterative sure independence screening expectation-maximization Bayesian Lasso (ISIS EM-BLASSO) method. In this work, a modified version of ISIS EM-BLASSO is proposed, which reduces the number of SNPs by a screening methodology based on Pearson correlation and mutual information, then estimates the effects via EM-Bayesian Lasso (EM-BLASSO), and finally detects the true quantitative trait nucleotides (QTNs) through likelihood ratio test. We call our method a two-stage mutual information based Bayesian Lasso (MBLASSO). Under three simulation scenarios, MBLASSO improves the statistical power and retains the higher effect estimation accuracy when comparing with three other algorithms. Moreover, MBLASSO performs best on model fitting, the accuracy of detected associations is the highest, and 21 genes can only be detected by MBLASSO in Arabidopsis thaliana datasets.

[1]  C. L. Tamba,et al.  pLARmEB: integration of least angle regression with empirical Bayes for multilocus genome-wide association studies , 2017, Heredity.

[2]  Ying Liu,et al.  FaST linear mixed models for genome-wide association studies , 2011, Nature Methods.

[3]  Runze Li,et al.  Variable Selection via Partial Correlation. , 2017, Statistica Sinica.

[4]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Runze Li,et al.  Feature Screening via Distance Correlation Learning , 2012, Journal of the American Statistical Association.

[6]  M. McMullen,et al.  A unified mixed-model method for association mapping that accounts for multiple levels of relatedness , 2006, Nature Genetics.

[7]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[8]  Zhiwu Zhang,et al.  Mixed linear model approach adapted for genome-wide association studies , 2010, Nature Genetics.

[9]  Kyunga Kim,et al.  Elastic-net regularization approaches for genome-wide association studies of rheumatoid arthritis , 2009, BMC proceedings.

[10]  S. Rhee,et al.  Functional Annotation of the Arabidopsis Genome Using Controlled Vocabularies1 , 2004, Plant Physiology.

[11]  Jun Zhang,et al.  Robust rank correlation based screening , 2010, 1012.4255.

[12]  Rongling Wu,et al.  2HiGWAS: a unifying high-dimensional platform to infer the global genetic architecture of trait development , 2015, Briefings Bioinform..

[13]  Fan Zhang,et al.  The Application of Multi-Locus GWAS for the Detection of Salt-Tolerance Loci in Rice , 2018, Front. Plant Sci..

[14]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[15]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[16]  R. Wu,et al.  Two‐stage identification of SNP effects on dynamic poplar growth , 2018, The Plant journal : for cell and molecular biology.

[17]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[18]  Yuan-Li Ni,et al.  Iterative sure independence screening EM-Bayesian LASSO algorithm for multi-locus genome-wide association studies , 2017, PLoS Comput. Biol..

[19]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[20]  D. Heckerman,et al.  Efficient Control of Population Structure in Model Organism Association Mapping , 2008, Genetics.

[21]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[22]  Shizhong Xu,et al.  An expectation–maximization algorithm for the Lasso estimation of quantitative trait locus effects , 2010, Heredity.

[23]  Yang-Jun Wen,et al.  pKWmEB: integration of Kruskal–Wallis test with empirical Bayes under polygenic background control for multi-locus genome-wide association study , 2017, Heredity.

[24]  Bjarni J. Vilhjálmsson,et al.  Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines , 2010 .

[25]  Bo Huang,et al.  Improving power and accuracy of genome-wide association studies via a multi-locus mixed linear model methodology , 2016, Scientific Reports.

[26]  Hong-Bin Shen,et al.  MACOED: a multi-objective ant colony optimization algorithm for SNP epistasis detection in genome-wide association studies , 2015, Bioinform..

[27]  Guifang Fu,et al.  The Bayesian lasso for genome-wide association studies , 2011, Bioinform..

[28]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[29]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[30]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[31]  K. Borgwardt,et al.  AraPheno and the AraGWAS Catalog 2020: a major database update including RNA-Seq and knockout mutation data for Arabidopsis thaliana , 2019, Nucleic Acids Res..

[32]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[33]  M. Stephens,et al.  Genome-wide Efficient Mixed Model Analysis for Association Studies , 2012, Nature Genetics.