A modification of the Lasso method by using the Bahadur representation for the genome-wide association study

A modification of the Lasso method as a powerful machine learning tool applied to a genome-wide association study is proposed in the paper. From the machine learning point of view, a feature selection problem is solved in the paper, where features are single nucleotide polymorphisms or DNA-markers whose association with a quantitative trait is established. The main idea underlying the modification is to take into account correlations between DNA-markers and peculiarities of phenotype values by using the Bahadur representation of joint probabilities of binary random variables. Interactions of DNA-markers called the epistasis are also considered in the framework of the proposed modification. Various numerical experiments with real datasets illustrate the proposed modification.

[1]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[2]  Sang-Ho Lee,et al.  Discriminant analysis of binary data following multivariate Bernoulli distribution , 2011, Expert Syst. Appl..

[3]  David BotsteinS’B Mapping Mendelian Factors Underlying Quantitative Traits Using RFLP Linkage Maps , 2002 .

[4]  Albert Y. Zomaya,et al.  A genetic ensemble approach for gene-gene interaction identification , 2010, BMC Bioinformatics.

[5]  H. Cordell Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. , 2002, Human molecular genetics.

[6]  W. Powell,et al.  The Dicktoo x Morex Population , 1997 .

[7]  Jing Li,et al.  Detecting epistatic effects in association studies at a genomic level based on an ensemble approach , 2011, Bioinform..

[8]  I. Romagosa,et al.  Comparative mapping of the Oregon Wolfe Barley using doubled haploid lines derived from female and male gametes , 2011, Theoretical and Applied Genetics.

[9]  P. Hayes,et al.  Quantitative trait loci on barley (Hordeum vulgare L.) chromosome 7 associated with components of winterhardiness. , 1993, Genome.

[10]  H. O. Lancaster The Structure of Bivariate Distributions , 1958 .

[11]  Shyam Visweswaran,et al.  Mining Epistatic Interactions from High-Dimensional Data Sets , 2012 .

[12]  Daniel Gianola,et al.  Predicting genetic predisposition in humans: the promise of whole-genome markers , 2010, Nature Reviews Genetics.

[13]  Xiang Zhang,et al.  Chapter 10: Mining Genome-Wide Genetic Markers , 2012, PLoS Comput. Biol..

[14]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[15]  Gerhard Tutz,et al.  Penalized regression with correlation-based penalty , 2009, Stat. Comput..

[16]  Patrick M Hayes,et al.  Construction and application for QTL analysis of a Restriction Site Associated DNA (RAD) linkage map in barley , 2011, BMC Genomics.

[17]  Qiang Yang,et al.  BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies , 2010, American journal of human genetics.

[18]  Stefano Lonardi,et al.  Development and implementation of high-throughput SNP genotyping in barley , 2009, BMC Genomics.

[19]  Wilker Altidor,et al.  Ensemble Feature Ranking Methods for Data Intensive Computing Applications , 2011 .

[20]  Jayaram Raghuram,et al.  Comparative analysis of methods for detecting interacting loci , 2011, BMC Genomics.

[21]  Bruce G. Lindsay,et al.  ISSUES AND STRATEGIES IN THE SELECTION OF COMPOSITE LIKELIHOODS , 2011 .

[22]  J. Ogutu,et al.  Genomic selection using regularized linear regression models: ridge regression, lasso, elastic net and their extensions , 2012, BMC Proceedings.

[23]  S. Browning,et al.  A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic , 2009, PLoS genetics.

[24]  Shizhong Xu,et al.  Empirical Bayesian LASSO-logistic regression for multiple binary trait locus mapping , 2012, BMC Genetics.

[25]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[26]  Sara Casu,et al.  Alternative strategies for selecting subsets of predicting SNPs by LASSO-LARS procedure , 2012, BMC Proceedings.

[27]  Nengjun Yi,et al.  Bayesian Model Selection for Genome-Wide Epistatic Quantitative Trait Loci Analysis , 2005, Genetics.

[28]  Genetic Prediction of Quantitative Lipid Traits: Comparing Shrinkage Models to Gene Scores , 2014, Genetic epidemiology.

[29]  In-Hee Lee,et al.  A filter-based feature selection approach for identifying potential biomarkers for lung cancer , 2011, Journal of Clinical Bioinformatics.

[30]  B. Jiang,et al.  Bayesian Models for Detecting Epistatic Interactions from Genetic Data , 2011, Annals of human genetics.

[31]  Guosheng Yin,et al.  Bayesian two-step Lasso strategy for biomarker selection in personalized medicine development for time-to-event endpoints. , 2013, Contemporary clinical trials.

[32]  C. D. Page,et al.  Random Forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle. , 2013, Journal of dairy science.

[33]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[34]  Crispin M. Mutshinda,et al.  Extended Bayesian LASSO for Multiple Quantitative Trait Loci Mapping and Unobserved Phenotype Prediction , 2010, Genetics.

[35]  F. Schenkel,et al.  SNP selection for predicting a quantitative trait , 2013 .

[36]  Jon Doyle,et al.  Bayesian neural networks for detecting epistasis in genetic association studies , 2014, BMC Bioinformatics.

[37]  B. Hayes,et al.  Overview of Statistical Methods for Genome-Wide Association Studies (GWAS). , 2013, Methods in molecular biology.

[38]  Zitong Li,et al.  Overview of LASSO-related penalized regression methods for quantitative trait mapping and genomic selection , 2012, Theoretical and Applied Genetics.

[39]  Naomi R. Wray,et al.  Estimating Effects and Making Predictions from Genome-Wide Marker Data , 2010, 1010.4710.

[40]  Mee Young Park,et al.  Regularization Path Algorithms for Detecting Gene Interactions , 2006 .

[41]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[42]  Xiaojian Yang,et al.  The LASSO and Sparse Least Squares Regression Methods for SNP Selection in Predicting Quantitative Traits , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[43]  J. Bocianowski Estimation of epistasis in doubled haploid barley populations considering interactions between all possible marker pairs , 2014, Euphytica.

[44]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[45]  Guimei Liu,et al.  An empirical comparison of several recent epistatic interaction detection methods , 2011, Bioinform..

[46]  Hua Zhou,et al.  Penalized Regression for Genome-Wide Association Screening of Sequence Data , 2011, Pacific Symposium on Biocomputing.

[47]  秀俊 松井,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2014 .

[48]  P. Visscher,et al.  Pitfalls of predicting complex traits from SNPs , 2013, Nature Reviews Genetics.

[49]  Jian Huang,et al.  Incorporating group correlations in genome-wide association studies using smoothed group Lasso. , 2013, Biostatistics.

[50]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[51]  S. Wright,et al.  Genetic analysis of the components of winterhardiness in barley (Hordeum vulgare L.) , 1994, Theoretical and Applied Genetics.

[52]  K. Roeder,et al.  Screen and clean: a tool for identifying interactions in genome‐wide association studies , 2010, Genetic epidemiology.