locStra: Fast analysis of regional/global stratification in whole‐genome sequencing studies

locStra is an R ‐package for the analysis of regional and global population stratification in whole‐genome sequencing (WGS) studies, where regional stratification refers to the substructure defined by the loci in a particular region on the genome. Population substructure can be assessed based on the genetic covariance matrix, the genomic relationship matrix, and the unweighted/weighted genetic Jaccard similarity matrix. Using a sliding window approach, the regional similarity matrices are compared with the global ones, based on user‐defined window sizes and metrics, for example, the correlation between regional and global eigenvectors. An algorithm for the specification of the window size is provided. As the implementation fully exploits sparse matrix algebra and is written in C++, the analysis is highly efficient. Even on single cores, for realistic study sizes (several thousand subjects, several million rare variants per subject), the runtime for the genome‐wide computation of all regional similarity matrices does typically not exceed one hour, enabling an unprecedented investigation of regional stratification across the entire genome. The package is applied to three WGS studies, illustrating the varying patterns of regional substructure across the genome and its beneficial effects on association testing.

[1]  Xihong Lin,et al.  Sparse Principal Component Analysis for Identifying Ancestry‐Informative Markers in Genome‐Wide Association Studies , 2012, Genetic epidemiology.

[2]  R. Mises,et al.  Praktische Verfahren der Gleichungsauflösung . , 1929 .

[3]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[4]  Christoph Lange,et al.  locStra: Fast analysis of regional/global stratification in whole genome sequencing (WGS) studies , 2020, bioRxiv.

[5]  B. Neale,et al.  Linkage disequilibrium dependent architecture of human complex traits reveals action of negative selection , 2016, bioRxiv.

[6]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[7]  Daniel Schlauch,et al.  Identification of genetic outliers due to sub‐structure and cryptic relationships , 2017, Bioinform..

[8]  A. Price,et al.  Functional architecture of low-frequency variants highlights strength of negative selection across coding and noncoding annotations , 2018, Nature Genetics.

[9]  Alfred Hausladen,et al.  Endogenous Protein S-Nitrosylation in E. coli: Regulation by OxyR , 2012, Science.

[10]  P. Donnelly,et al.  Association mapping in structured populations. , 2000, American journal of human genetics.

[11]  K. Shianna,et al.  A Genome-Wide Association Study in Chronic Obstructive Pulmonary Disease (COPD): Identification of Two Major Susceptibility Loci , 2009, PLoS genetics.

[12]  E. Thompson,et al.  Efficient Estimation of Realized Kinship from Single Nucleotide Polymorphism Genotypes , 2017, Genetics.

[13]  P. Robinson,et al.  Efficient Estimation of the , 2007 .

[14]  Christoph Lange,et al.  Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project , 2016, Bioinform..

[15]  A. Clark,et al.  Recent Explosive Human Population Growth Has Resulted in an Excess of Rare Genetic Variants , 2012, Science.

[16]  Douglas M. Bates,et al.  Fast and Elegant Numerical Linear Algebra Using the RcppEigen Package , 2013 .

[17]  N. Risch,et al.  Reconstructing genetic ancestry blocks in admixed individuals. , 2006, American journal of human genetics.

[18]  E. Martin,et al.  Properties of global‐ and local‐ancestry adjustments in genetic association tests in admixed populations , 2018, Genetic epidemiology.

[19]  Eric Boerwinkle,et al.  Rare variants analysis using penalization methods for whole genome sequence data , 2015, BMC Bioinformatics.

[20]  Andrew D. Johnson,et al.  Whole Genome Sequence-Based Analysis of a Model Complex Trait, High Density Lipoprotein Cholesterol , 2013, Nature Genetics.

[21]  N. Laird,et al.  A genome-wide association study identifies risk loci for spirometric measures among smokers of European and African ancestry , 2015, BMC Genetics.

[22]  Mikell P. Groover,et al.  FUNDAMENTALS OF MODERN , 2008 .

[23]  T. Thornton,et al.  Local and Global Ancestry Inference and Applications to Genetic Association Analysis for Admixed Populations , 2014, Genetic epidemiology.

[24]  K. Roeder,et al.  Genomic Control for Association Studies , 1999, Biometrics.

[25]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[26]  Pedro C. Avila,et al.  Fast and accurate inference of local ancestry in Latino populations , 2012, Bioinform..

[27]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[28]  J. Stamatoyannopoulos,et al.  Power of deep, all-exon resequencing for discovery of human trait genes , 2009, Proceedings of the National Academy of Sciences.

[29]  E. Halperin,et al.  Estimating Local Ancestry in Admixed Populations , 2022 .

[30]  Xiaofeng Zhu,et al.  Adjustment for local ancestry in genetic association analysis of admixed populations , 2011, Bioinform..

[31]  W. Bodmer,et al.  Common and rare variants in multifactorial susceptibility to common diseases , 2008, Nature Genetics.

[32]  Eleazar Eskin,et al.  Improved linear mixed models for genome-wide association studies , 2012, Nature Methods.

[33]  Simon Myers,et al.  Fine-Scale Inference of Ancestry Segments Without Prior Knowledge of Admixing Groups , 2019, Genetics.

[34]  C. Moser,et al.  Variability of candidate genes, genetic structure and association with sugar accumulation and climacteric behavior in a broad germplasm collection of melon (Cucumis melo L.) , 2015, BMC Genetics.

[35]  Yizhen Zhong,et al.  On Using Local Ancestry to Characterize the Genetic Architecture of Human Traits: Genetic Regulation of Gene Expression in Multiethnic or Admixed Populations. , 2019, American journal of human genetics.

[36]  D. Reich,et al.  Sensitive Detection of Chromosomal Segments of Distinct Ancestry in Admixed Populations , 2009, PLoS genetics.

[37]  C. Bustamante,et al.  RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. , 2013, American journal of human genetics.

[38]  S. Myers,et al.  Fine-Scale Inference of Ancestry Segments Without Prior Knowledge of Admixing Groups , 2018, Genetics.

[39]  J. Carpenter,et al.  Estimation of admixture and detection of linkage in admixed populations by a Bayesian approach: application to African‐American populations , 2000, Annals of human genetics.

[40]  E. Silverman,et al.  Sensitization to Ascaris lumbricoides and severity of childhood asthma in Costa Rica. , 2007, The Journal of allergy and clinical immunology.

[41]  Eleftheria Zeggini,et al.  In search of low-frequency and rare variants affecting complex traits , 2013, Human molecular genetics.

[42]  E. Regan,et al.  Genetic Epidemiology of COPD (COPDGene) Study Design , 2011, COPD.

[43]  Christoph Lange,et al.  The Fundamentals of Modern Statistical Genetics , 2010 .