Principal variable approach to multipurpose SNP selection in genetic association studies

Despite the various merits of joint analysis of the multiple markers, a single marker analysis is still popularly adopted in many Genome-Wide Association Studies GWAS. Since GWAS data tend to have many near-duplicated SNPs in the linkage equilibrium, it is a challenge to eliminate the redundant SNPs and determine the subset of the informative SNPs to be included in the joint analysis. In this study, we propose an unsupervised SNP selection algorithm based on the principal variable approach called the multipurpose SNP selection MP-SNP method. MP-SNP method takes subset of the original variables to keep the structure and information of the original variables, and the resulting SNP subset could be used for further analysis in various ways. Based on our simulation and real data analysis, we conclude that the MP-SNP method shows good performance in selecting the informative SNPs and also provides well-explained cluster structures.

[1]  Jingwu He,et al.  Informative SNP Selection Methods Based on SNP Prediction , 2007, IEEE Transactions on NanoBioscience.

[2]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[3]  S. Gabriel,et al.  The Structure of Haplotype Blocks in the Human Genome , 2002, Science.

[4]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[5]  B. Horne,et al.  Principal component analysis for selection of optimal SNP‐sets that capture intragenic genetic variation , 2004, Genetic epidemiology.

[6]  Jorge Cadima Departamento de Matematica Loading and correlations in the interpretation of principle compenents , 1995 .

[7]  A. Singleton,et al.  Genomewide association studies and human disease. , 2009, The New England journal of medicine.

[8]  C. Carlson,et al.  Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. , 2004, American journal of human genetics.

[9]  Stephen J O'Brien,et al.  Accounting for multiple comparisons in a genome-wide association study (GWAS) , 2010, BMC Genomics.

[11]  Gad Abraham,et al.  Fast Principal Component Analysis of Large-Scale Genome-Wide Data , 2014, bioRxiv.

[12]  Deepayan Sarkar,et al.  Lattice: Multivariate Data Visualization with R , 2008 .

[13]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[14]  Richard M. Karp,et al.  CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts , 2001, ISMB.

[15]  E. Shin,et al.  IL‐5 and IL‐5 receptor alpha polymorphisms are associated with atopic dermatitis in Koreans , 2007, Allergy.

[16]  I. Jolliffe Principal Component Analysis , 2002 .

[17]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.

[18]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[19]  Seunghyun Lee,et al.  Multi-purpose SNP Selection by the principal variables for a genetic study , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[20]  Qianchuan He,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[21]  M. Daly,et al.  Estimation of the multiple testing burden for genomewide association studies of nearly all common variants , 2008, Genetic epidemiology.

[22]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[23]  David A. Wooff,et al.  Dimension reduction via principal variables , 2007, Comput. Stat. Data Anal..

[24]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..