Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations

BackgroundThrough the wealth of information contained within them, genome-wide association studies (GWAS) have the potential to provide researchers with a systematic means of associating genetic variants with a wide variety of disease phenotypes. Due to the limitations of approaches that have analyzed single variants one at a time, it has been proposed that the genetic basis of these disorders could be determined through detailed analysis of the genetic variants themselves and in conjunction with one another. The construction of models that account for these subsets of variants requires methodologies that generate predictions based on the total risk of a particular group of polymorphisms. However, due to the excessive number of variants, constructing these types of models has so far been computationally infeasible.ResultsWe have implemented an algorithm, known as greedy RLS, that we use to perform the first known wrapper-based feature selection on the genome-wide level. The running time of greedy RLS grows linearly in the number of training examples, the number of features in the original data set, and the number of selected features. This speed is achieved through computational short-cuts based on matrix calculus. Since the memory consumption in present-day computers can form an even tighter bottleneck than running time, we also developed a space efficient variation of greedy RLS which trades running time for memory. These approaches are then compared to traditional wrapper-based feature selection implementations based on support vector machines (SVM) to reveal the relative speed-up and to assess the feasibility of the new algorithm. As a proof of concept, we apply greedy RLS to the Hypertension – UK National Blood Service WTCCC dataset and select the most predictive variants using 3-fold external cross-validation in less than 26 minutes on a high-end desktop. On this dataset, we also show that greedy RLS has a better classification performance on independent test data than a classifier trained using features selected by a statistical p-value-based filter, which is currently the most popular approach for constructing predictive models in GWAS.ConclusionsGreedy RLS is the first known implementation of a machine learning based method with the capability to conduct a wrapper-based feature selection on an entire GWAS containing several thousand examples and over 400,000 variants. In our experiments, greedy RLS selected a highly predictive subset of genetic variants in a fraction of the time spent by wrapper-based selection methods used together with SVM classifiers. The proposed algorithms are freely available as part of the RLScore software library at http://users.utu.fi/aatapa/RLScore/.

[1]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[2]  Jason H. Moore,et al.  BIOINFORMATICS REVIEW , 2005 .

[3]  Marcin Krawczyk,et al.  Genome-wide association studies and genetic risk assessment of liver diseases , 2010, Nature Reviews Gastroenterology &Hepatology.

[4]  Johan A. K. Suykens,et al.  Low rank updated LS-SVM classifiers for fast variable selection , 2008, Neural Networks.

[5]  Hilda Silva Ferreira,et al.  Role of central 5-HT3 receptors in the control of blood pressure in stressed and non-stressed rats , 2004, Brain Research.

[6]  V Kren,et al.  A 14-gene region of rat chromosome 8 in SHR-derived polydactylous congenic substrain affects muscle-specific insulin resistance, dyslipidaemia and visceral adiposity. , 2005, Folia biologica.

[7]  Jin-Kao Hao,et al.  Advances in metaheuristics for gene selection and classification of microarray data , 2010, Briefings Bioinform..

[8]  Philippa J. Talmud,et al.  Utility of genetic determinants of lipids and cardiovascular events in assessing risk , 2011, Nature Reviews Cardiology.

[9]  Glenn Fung,et al.  Proximal support vector machine classifiers , 2001, KDD '01.

[10]  Yan Wang,et al.  Genome-wide association study identifies two new susceptibility loci for atopic dermatitis in the Chinese Han population , 2011, Nature Genetics.

[11]  L. Zimmerli,et al.  Angiogenesis and hypertension: an update , 2009, Journal of Human Hypertension.

[12]  Tomaso Poggio,et al.  Everything old is new again: a fresh look at historical approaches in machine learning , 2002 .

[13]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[14]  Soonil Kwon,et al.  Application of Bayesian classification with singular value decomposition method in genome-wide association studies , 2009, BMC proceedings.

[15]  Kui Zhang,et al.  Genome-wide association studies of rheumatoid arthritis data via multiple hypothesis testing methods for correlated tests , 2009 .

[16]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[18]  Yan V. Sun,et al.  Machine learning in genome‐wide association studies , 2009, Genetic epidemiology.

[19]  Ingrid B. Borecki,et al.  Multiple Genes Influence BMI on Chromosome 7q31–34: The NHLBI Family Heart Study , 2009, Obesity.

[20]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Pedro Larrañaga,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004, Artif. Intell. Medicine.

[22]  Burton B. Yang,et al.  MicroRNA-378 promotes cell survival, tumor growth, and angiogenesis by targeting SuFu and Fus-1 expression , 2007, Proceedings of the National Academy of Sciences.

[23]  Satish Chikkagoudar,et al.  Ranking causal variants and associated regions in genome-wide association studies by the support vector machine and random forest , 2011, Nucleic acids research.

[24]  J. Suykens,et al.  A kernel-based integration of genome-wide data for clinical decision support , 2009, Genome Medicine.

[25]  M. Pontil Leave-one-out error and stability of learning algorithms with applications , 2002 .

[26]  Miriam C J M Sturkenboom,et al.  Genetic polymorphisms and heart failure , 2004, Genetics in Medicine.

[27]  Richard Simon,et al.  Bias in error estimation when using cross-validation for model selection , 2006, BMC Bioinformatics.

[28]  S. R. Searle,et al.  On Deriving the Inverse of a Sum of Matrices , 1981 .

[29]  Park,et al.  Open Access Research Article Identification of Type 2 Diabetes-associated Combination of Snps Using Support Vector Machine , 2022 .

[30]  David M. Reif,et al.  Machine Learning for Detecting Gene-Gene Interactions , 2006, Applied bioinformatics.

[31]  Scott T. Weiss,et al.  A Genome-Wide Association Study of Pulmonary Function Measures in the Framingham Heart Study , 2009, PLoS genetics.

[32]  Qingzhong Liu,et al.  Supervised learning-based tagSNP selection for genome-wide disease classifications , 2008, BMC Genomics.

[33]  Jon Genuneit,et al.  Unifying Candidate Gene and GWAS Approaches in Asthma , 2010, PloS one.

[34]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[35]  Maria Petrou,et al.  Proceedings of the 17th International Conference on Pattern Recognition (ICPR 2004) , 2004 .

[36]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[37]  Tomaso A. Poggio,et al.  Regularization Networks and Support Vector Machines , 2000, Adv. Comput. Math..

[38]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[39]  Wei Du,et al.  Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines , 2003, FEBS letters.

[40]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[41]  Scott M. Williams,et al.  Epistasis and its implications for personal genetics. , 2009, American journal of human genetics.

[42]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[43]  Tero Aittokallio,et al.  Genetic Variants and Their Interactions in the Prediction of Increased Pre-Clinical Carotid Atherosclerosis: The Cardiovascular Risk in Young Finns Study , 2010, PLoS genetics.

[44]  Andrew D. Johnson,et al.  Genome-wide association study of blood pressure and hypertension , 2009, Nature Genetics.

[45]  A. Butte,et al.  Extreme Evolutionary Disparities Seen in Positive Selection across Seven Complex Diseases , 2010, PloS one.

[46]  Nora Franceschini,et al.  Recent findings in the genetics of blood pressure and hypertension traits. , 2011, American journal of hypertension.

[47]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[48]  Tapio Salakoski,et al.  Fast n-Fold Cross-Validation for Regularized Least-Squares , 2006 .

[49]  R. Rifkin,et al.  Notes on Regularized Least Squares , 2007 .

[50]  E. Boerwinkle,et al.  Mining gold dust under the genome wide significance level: a two‐stage approach to analysis of GWAS , 2011, Genetic epidemiology.

[51]  Tapio Pahikkala,et al.  An efficient algorithm for learning to rank from preference graphs , 2009, Machine Learning.

[52]  Daniel E. Weeks,et al.  Interpretation of Genetic Association Studies: Markers with Replicated Highly Significant Odds Ratios May Be Poor Classifiers , 2009, PLoS genetics.

[53]  Jing Peng,et al.  SVM vs regularized least squares classification , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[54]  Edward R. Dougherty,et al.  Is cross-validation better than resubstitution for ranking genes? , 2004, Bioinform..

[55]  Qianchuan He,et al.  A variable selection method for genome-wide association studies , 2011, Bioinform..

[56]  InzaIñaki,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004 .

[57]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[58]  Joseph T. Glessner,et al.  From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes , 2009, PLoS genetics.

[59]  P. Lachenbruch An almost unbiased method of obtaining confidence intervals for the probability of misclassification in discriminant analysis. , 1967, Biometrics.

[60]  Vladimir Vapnik,et al.  Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics) , 1982 .

[61]  Tapani Raiko,et al.  Proceedings of the Ninth Scandinavian Conference on Artificial Intelligence (SCAI 2006) , 2006 .

[62]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[63]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[64]  Chuhsing Kate Hsiao,et al.  A new regularized least squares support vector regression for gene selection , 2009, BMC Bioinformatics.

[65]  Xin Yao,et al.  Gene selection algorithms for microarray data based on least squares support vector machine , 2006, BMC Bioinformatics.

[66]  Peter M Visscher,et al.  Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk. , 2009, Human molecular genetics.

[67]  Jin-Kao Hao,et al.  A Hybrid GA/SVM Approach for Gene Selection and Classification of Microarray Data , 2006, EvoWorkshops.

[68]  T. Poggio,et al.  The Mathematics of Learning: Dealing with Data , 2005, 2005 International Conference on Neural Networks and Brain.

[69]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[70]  Andrei S. Rodin,et al.  Use of Wrapper Algorithms Coupled with a Random Forests Classifier for Variable Selection in Large-Scale Genomic Association Studies , 2009, J. Comput. Biol..

[71]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[72]  Jason H. Moore,et al.  Exploiting the proteome to improve the genome-wide genetic analysis of epistasis in common human diseases , 2008, Human Genetics.

[73]  Tapio Salakoski,et al.  Speeding Up Greedy Forward Selection for Regularized Least-Squares , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[74]  R. Shah,et al.  Least Squares Support Vector Machines , 2022 .

[75]  Nicholette D. Palmer,et al.  A genome-wide association scan for acute insulin response to glucose in Hispanic-Americans: the Insulin Resistance Atherosclerosis Family Study (IRAS FS) , 2009, Diabetologia.

[76]  Wenjiang J. Fu,et al.  Mapping Haplotype-haplotype Interactions with Adaptive LASSO , 2010, BMC Genetics.

[77]  Nancy R Cook,et al.  Association between a literature-based genetic risk score and cardiovascular events in women. , 2010, JAMA.

[78]  Mika Kähönen,et al.  Geographic Origin as a Determinant of Carotid Artery Intima-Media Thickness and Brachial Artery Flow-Mediated Dilation: The Cardiovascular Risk in Young Finns Study , 2004, Arteriosclerosis, thrombosis, and vascular biology.

[79]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[80]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[81]  G. D'Angelo,et al.  Combining least absolute shrinkage and selection operator (LASSO) and principal-components analysis for detection of gene-gene interactions in genome-wide association studies , 2009, BMC proceedings.

[82]  K. Weigel,et al.  Machine learning classification procedure for selecting SNPs in genomic selection: application to early mortality in broilers. , 2007, Developments in biologicals.

[83]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..