Consequences of PCA graphs, SNP codings, and PCA variants for elucidating population structure

SNP datasets are high-dimensional, often with thousands to millions of SNPs and hundreds to thousands of samples or individuals. Accordingly, PCA graphs are frequently used to provide a low-dimensional visualization in order to display and discover patterns in SNP data from humans, animals, plants, and microbes—especially to elucidate population structure. PCA is not a single method that is always done the same way, but rather requires three choices which we explore as a three-way factorial: two kinds of PCA graphs by three SNP codings by six PCA variants. Our main three recommendations are simple and easily implemented: Use PCA biplots, SNP coding 1 for the rare allele and 0 for the common allele, and double-centered PCA (or AMMI1 if main effects are also of interest). We also document contemporary practices by a literature survey of 125 representative articles that apply PCA to SNP data, find that virtually none implement our recommendations. The ultimate benefit from informed and optimal choices of PCA graph, SNP coding, and PCA variant, is expected to be discovery of more biology, and thereby acceleration of medical, agricultural, and other vital applications.

[1]  Kenneth L. McNally,et al.  Assessing the genetic diversity of rice originating from Bangladesh, Assam and West Bengal , 2015, Rice.

[2]  Kenneth L. McNally,et al.  Assessing the genetic diversity of rice originating from Bangladesh, Assam , 2015 .

[3]  Trupti Joshi,et al.  Landscape of genomic diversity and trait discovery in soybean , 2016, Scientific Reports.

[4]  Edward S. Buckler,et al.  TASSEL: software for association mapping of complex traits in diverse samples , 2007, Bioinform..

[5]  M. Seielstad,et al.  Genetic structure of the Han Chinese population revealed by genome-wide SNP variation. , 2009, American journal of human genetics.

[6]  Bruce Budowle,et al.  Empirical testing of a 23-AIMs panel of SNPs for ancestry evaluations in four major US populations , 2016, International Journal of Legal Medicine.

[7]  Abdelouahhab Zaid,et al.  Whole genome re-sequencing of date palms yields insights into diversification of a fruit tree crop , 2015, Nature Communications.

[8]  Sayan Mukherjee,et al.  Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. , 2016, American journal of human genetics.

[9]  K. Gabriel,et al.  The biplot graphic display of matrices with application to principal component analysis , 1971 .

[10]  Jake K. Byrnes,et al.  Reconstructing the Population Genetic History of the Caribbean , 2013, PLoS genetics.

[11]  G. McVean A Genealogical Interpretation of Principal Components Analysis , 2009, PLoS genetics.

[12]  J. Edward Jackson,et al.  A User's Guide to Principal Components. , 1991 .

[13]  Noah A. Rosenberg,et al.  A Quantitative Comparison of the Similarity between Genes and Geography in Worldwide Human Populations , 2012, PLoS genetics.

[14]  Amit R. Indap,et al.  Genes mirror geography within Europe , 2008, Nature.

[15]  R. Chakraborty,et al.  Selection of highly informative SNP markers for population affiliation of major US populations , 2016, International Journal of Legal Medicine.

[16]  M. Stephens,et al.  Interpreting principal component analyses of spatial population genetic variation , 2008, Nature Genetics.

[17]  Kathleen F. Kerr,et al.  Genetic Diversity and Association Studies in US Hispanic/Latino Populations: Applications in the Hispanic Community Health Study/Study of Latinos. , 2016, American journal of human genetics.

[18]  P. Digby,et al.  Multivariate Analysis of Ecological Communities , 1987, Population and Community Biology.

[19]  M. Hill Correspondence Analysis: A Neglected Multivariate Method , 1974 .

[20]  Jean-Luc Jannink,et al.  Population genetics of genomics-based crop improvement methods. , 2011, Trends in genetics : TIG.

[21]  Y. Assefa,et al.  Major Management Factors Determining Spring and Winter Canola Yield in North America , 2018 .

[22]  John Elmerdahl Olsen,et al.  Insight into synergetic mechanisms of tetracycline and the selective serotonin reuptake inhibitor, sertraline, in a tetracycline-resistant strain of Escherichia coli , 2017, The Journal of Antibiotics.

[23]  David Reich,et al.  Principal component analysis of genetic data , 2008, Nature Genetics.

[24]  Zhiwu Zhang,et al.  Genetic characteristics of soybean resistance to HG type 0 and HG type 1.2.3.5.7 of the cyst nematode analyzed by genome-wide association mapping , 2015, BMC Genomics.

[25]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[26]  Gad Abraham,et al.  Fast Principal Component Analysis of Large-Scale Genome-Wide Data , 2014, bioRxiv.

[27]  M. Hill,et al.  Reciprocal Averaging : an eigenvector method of ordination , 1973 .

[28]  Hugh G. Gauch,et al.  A COMPARATIVE STUDY OF RECIPROCAL AVERAGING AND OTHER ORDINATION TECHNIQUES , 1977 .

[29]  Hans-Peter Piepho,et al.  Biplots: Do Not Stretch Them! , 2018 .

[30]  Joseph Coombs,et al.  Genetic Diversity and Relationship of Ethiopian Potato Varieties to Germplasm from North America, Europe and the International Potato Center , 2016, American Journal of Potato Research.

[31]  Jing Wang,et al.  On the Origin of Tibetans and Their Genetic Basis in Adapting High-Altitude Environments , 2011, PloS one.

[32]  Hugh G. Gauch,et al.  Multivariate analysis in community ecology , 1984 .

[33]  Sharad Goel,et al.  HORSESHOES IN MULTIDIMENSIONAL SCALING AND LOCAL KERNEL METHODS , 2008, 0811.1477.

[34]  Daniel Andersson,et al.  Identification of Distinct Breast Cancer Stem Cell Populations Based on Single-Cell Analyses of Functionally Enriched Stem and Progenitor Pools , 2016, Stem cell reports.

[35]  S. Glantz,et al.  Primer of Applied Regression & Analysis of Variance , 1990 .

[36]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[37]  Weikai Yan,et al.  Population Genomics Related to Adaptation in Elite Oat Germplasm , 2016, The plant genome.

[38]  John C. Gower,et al.  Understanding Biplots: Gower/Understanding Biplots , 2011 .

[39]  Hugh G. Gauch,et al.  Statistical Analysis of Yield Trials by AMMI and GGE: Further Considerations , 2008 .

[40]  Faisal Ahmad Khan,et al.  Comparative Analysis of Stress Induced Gene Expression in Caenorhabditis elegans following Exposure to Environmental and Lab Reconstituted Complex Metal Mixture , 2015, PloS one.

[41]  D. Kendall,et al.  Mathematics in the Archaeological and Historical Sciences , 1971, The Mathematical Gazette.

[42]  J. E. Jackson A User's Guide to Principal Components , 1991 .

[43]  P. G. N. Digby,et al.  Multivariate Analysis of Ecological Communities , 1987 .

[44]  Robert J. Elshire,et al.  Comprehensive genotyping of the USA national maize inbred seed bank , 2013, Genome Biology.

[45]  C. Bustamante,et al.  Inferring genome-wide patterns of admixture in Qataris using fifty-five ancestral populations , 2012, BMC Genetics.

[46]  P. H. A. Sneath Mathematics in the Archaeological and Historical Sciences , 1972 .

[47]  Andrey Ziyatdinov,et al.  Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr , 2018, Bioinform..