Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.

[1]  Nathan Halko,et al.  An Algorithm for the Principal Component Analysis of Large Data Sets , 2010, SIAM J. Sci. Comput..

[2]  Ying Liu,et al.  FaST linear mixed models for genome-wide association studies , 2011, Nature Methods.

[3]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[4]  I. Johnstone,et al.  On Consistency and Sparsity for Principal Components Analysis in High Dimensions , 2009, Journal of the American Statistical Association.

[5]  Sarah Edkins,et al.  Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease , 2011, Nature Genetics.

[6]  Sharon R Grossman,et al.  Integrating common and rare genetic variation in diverse human populations , 2010, Nature.

[7]  Oliver Stegle,et al.  A Lasso multi-marker mixed model for association mapping with population structure correction , 2013, Bioinform..

[8]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[9]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[10]  Amit R. Indap,et al.  Genes mirror geography within Europe , 2008, Nature.

[11]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[12]  Rick Twee-Hee Ong,et al.  varLD: a program for quantifying variation in linkage disequilibrium patterns between populations , 2010, Bioinform..

[13]  Xihong Lin,et al.  Sparse Principal Component Analysis for Identifying Ancestry‐Informative Markers in Genome‐Wide Association Studies , 2012, Genetic epidemiology.

[14]  Elizabeth T. Cirulli,et al.  Common Genetic Variation and the Control of HIV-1 in Humans , 2009, PLoS genetics.