Sparse Coding for Feature Selection on Genome-Wide Association Data

Genome-wide association (GWA) studies provide large amounts of high-dimensional data. GWA studies aim to identify variables that increase the risk for a given phenotype. Univariate examinations have provided some insights, but it appears that most diseases are affected by interactions of multiple factors, which can only be identified through a multivariate analysis. However, multivariate analysis on the discrete, high-dimensional and low-sample-size GWA data is made more difficult by the presence of random effects and nonspecific coupling. In this work, we investigate the suitability of three standard techniques (p-values, SVM, PCA) for analyzing GWA data on several simulated datasets. We compare these standard techniques against a sparse coding approach; we demonstrate that sparse coding clearly outperforms the other approaches and can identify interacting factors in far higherdimensional datasets than the other three approaches.

[1]  Hakon Hakonarson,et al.  Genome-wide association studies in type 1 diabetes , 2009, Current diabetes reports.

[2]  Michael W. Mahoney,et al.  PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations , 2007, PLoS genetics.

[3]  Philip Rosenstiel,et al.  Genome-wide association study for Crohn's disease in the Quebec Founder Population identifies multiple validated disease loci , 2007, Proceedings of the National Academy of Sciences.

[4]  J. Crow,et al.  Hardy, Weinberg and language impediments. , 1999, Genetics.

[5]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[6]  Daniel F. Schwarz,et al.  New susceptibility locus for coronary artery disease on chromosome 3q22.3 , 2009, Nature Genetics.

[7]  Olle Melander,et al.  Polymorphisms associated with cholesterol and risk of cardiovascular events. , 2008, The New England journal of medicine.

[8]  Park,et al.  Open Access Research Article Identification of Type 2 Diabetes-associated Combination of Snps Using Support Vector Machine , 2022 .

[9]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[10]  Thomas Martinetz,et al.  Learning sparse codes for image reconstruction , 2010, ESANN.

[11]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[12]  Peter M Visscher,et al.  Prediction of individual genetic risk of complex disease. , 2008, Current opinion in genetics & development.

[13]  C. Gieger,et al.  Genomewide association analysis of coronary artery disease. , 2007, The New England journal of medicine.

[14]  John P A Ioannidis,et al.  Prediction of Cardiovascular Disease Outcomes and Established Cardiovascular Risk Factors by Genome-Wide Association Markers , 2009, Circulation. Cardiovascular genetics.

[15]  Joseph T. Glessner,et al.  From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes , 2009, PLoS genetics.

[16]  Junghan Song,et al.  Analysis of Multiple Single Nucleotide Polymorphisms of Candidate Genes Related to Coronary Heart Disease Susceptibility by Using Support Vector Machines , 2003, Clinical chemistry and laboratory medicine.

[17]  Richard M Watanabe,et al.  A principal-components-based clustering method to identify multiple variants associated with rheumatoid arthritis and arthritis-related autoantibodies , 2009, BMC proceedings.

[18]  Jason H. Moore,et al.  The Ubiquitous Nature of Epistasis in Determining Susceptibility to Common Human Diseases , 2003, Human Heredity.