Multi-purpose SNP Selection by the principal variables for a genetic study

In genome-wide association studies, the length of the single nucleotide polymorphisms (SNPs) has been drastically increased. The data may contain many near-duplicated SNPs in linkage equilibrium, which can cause difficulties in anaysis. It may also bring about many statistical problems in further analysis. Principal component analysis is a popular dimension reduction technique and is well known to be effective for many genetic association analyses. However, it is a linear combination of all the original variables, and does not provide direct interpretation about the original number of variables. The purpose of our study is to eliminate the redundant SNPs and select a smaller subset made of only the informative SNPs. We propose an unsupervised SNP selection algorithm based on the principal variable (PV) method. It achives the dimensionality reduction by selecting a subset of original variables called PVs that preserve as much information as possible. To find an optimal subset of SNPs, we focus on the criterion which minimizes the squared norm of the partial covariance matrix. We define principal component cluster by principal component analysis and choose the representative SNP with high loadings on important principal component on average. After discarding other SNPs in the PC cluster, we calculate the partial covariance matrix for the remaining variables given principal variable. To obtain the next representative SNP, the same procedure is iterated to the partial covariance matrix. The process repeats until there's no more variable to select or to meet some stopping criterion, the percentage variance in terms of trace or squared norm of the covariance matrix. The resulting subset of SNPs could be used for further analysis with multiple purposes such as gene-gene interactions. We illustrate the proposed method by real genotype data and compare its performance with five current selection methods for principal variables.

[1]  Jingwu He,et al.  Informative SNP Selection Methods Based on SNP Prediction , 2007, IEEE Transactions on NanoBioscience.

[2]  E. Shin,et al.  IL‐5 and IL‐5 receptor alpha polymorphisms are associated with atopic dermatitis in Koreans , 2007, Allergy.

[3]  Duarte Silva Discarding variables in principal component analysis: algorithms for all-subsets comparisons based on the RV coefficient , 2000 .

[4]  David A. Wooff,et al.  Dimension reduction via principal variables , 2007, Comput. Stat. Data Anal..

[5]  Stephen J O'Brien,et al.  Accounting for multiple comparisons in a genome-wide association study (GWAS) , 2010, BMC Genomics.

[6]  B. Horne,et al.  Principal component analysis for selection of optimal SNP‐sets that capture intragenic genetic variation , 2004, Genetic epidemiology.

[7]  Qianchuan He,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[8]  M. Daly,et al.  Estimation of the multiple testing burden for genomewide association studies of nearly all common variants , 2008, Genetic epidemiology.

[10]  Gad Abraham,et al.  Fast Principal Component Analysis of Large-Scale Genome-Wide Data , 2014, bioRxiv.

[11]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[12]  I. Jolliffe Principal Component Analysis , 2002 .

[13]  Ian T. Jolliffe,et al.  Variable selection and interpretation in correlation principal components , 2005 .

[14]  A. Singleton,et al.  Genomewide association studies and human disease. , 2009, The New England journal of medicine.

[15]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[16]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .