PCAmatchR: a flexible R package for optimal case-control matching using weighted principal components

SUMMARY A concern when conducting genome-wide association studies (GWAS) is the potential for population stratification, i.e. ancestry based genetic differences between cases and controls, that if not properly accounted for, could lead to biased association results. We developed PCAmatchR as an open source R package for performing optimal case-control matching using principal component analysis (PCA) to aid in selecting controls that are well matched by ancestry to cases. PCAmatchR takes user supplied PCA outputs and selects matching controls for cases by utilizing a weighted Mahalanobis distance metric which weights each principal component by the percent of genetic variation explained. Results from the 1000 Genomes Project data demonstrate both the functionality and performance of PCAmatchR for selecting matching controls for case populations as well as reducing inflation of association test statistics. PCAmatchR improves genomic similarity between matched cases and controls, which minimizes the effects of population stratification in GWAS analyses. AVAILABILITY PCAmatchR is freely available for download on GitHub (https://github.com/machiela-lab/PCAmatchR) or through CRAN (https://cran.r-project.org/web/packages/PCAmatchR/index.html). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Stephen Chanock,et al.  Population Substructure and Control Selection in Genome-Wide Association Studies , 2008, PloS one.

[2]  Lars G Fritsche,et al.  Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies , 2017, Nature Genetics.

[3]  Elizabeth A Stuart,et al.  Matching methods for causal inference: A review and a look forward. , 2010, Statistical science : a review journal of the Institute of Mathematical Statistics.

[4]  K. Strauch,et al.  Abstract A13: Genome-wide association study identifies multiple new loci associated with Ewing sarcoma susceptibility , 2018, Poster Presentations - Proffered Abstracts.

[5]  K Alaine Broadaway,et al.  Stratification‐Score Matching Improves Correction for Confounding by Population Stratification in Case‐Control Association Studies , 2012, Genetic epidemiology.

[6]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[7]  Michael Boehnke,et al.  Recommended Joint and Meta‐Analysis Strategies for Case‐Control Association Testing of Single Low‐Count Variants , 2013, Genetic epidemiology.

[8]  E. Krusinska,et al.  A valuation of state of object based on weighted Mahalanobis distance , 1987, Pattern Recognit..

[9]  Pablo Villoslada,et al.  Analysis and Application of European Genetic Substructure Using 300 K SNP Information , 2008, PLoS genetics.

[10]  Zhong Zhao,et al.  Using Matching to Estimate Treatment Effects: Data Requirements, Matching Metrics, and Monte Carlo Evidence , 2004, Review of Economics and Statistics.

[11]  Paul R. Rosenbaum,et al.  Optimal Matching for Observational Studies , 1989 .

[12]  David Reich,et al.  Discerning the Ancestry of European Americans in Genetic Association Studies , 2007, PLoS genetics.

[13]  Hongzhi Hu,et al.  Fault Diagnosis of Analogue Circuits with Weighted Mahalanobis Distance Based on Entropy Theory , 2013 .

[14]  Jorge Cadima,et al.  Principal component analysis: a review and recent developments , 2016, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[15]  Ann B. Lee,et al.  On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants. , 2008, American journal of human genetics.

[16]  Christopher I. Amos,et al.  Ancestry inference using principal component analysis and spatial analysis: a distance-based analysis to account for population substructure , 2017, BMC Genomics.

[17]  P. Visscher,et al.  Mixed model with correction for case-control ascertainment increases association power. , 2015, American journal of human genetics.

[18]  Markus Leber,et al.  Novel genetic matching methods for handling population stratification in genome-wide association studies , 2015, BMC Bioinformatics.

[19]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[20]  Gabor T. Marth,et al.  An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[21]  D. Rubin,et al.  Using Multivariate Matched Sampling and Regression Adjustment to Control Bias in Observational Studies , 1978 .

[22]  Mitchell J. Machiela,et al.  LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants , 2015, Bioinform..

[23]  B. Hansen,et al.  Optimal Full Matching and Related Designs via Network Flows , 2006 .

[24]  K. Konvička,et al.  Matching strategies for genetic association studies in structured populations. , 2004, American journal of human genetics.

[25]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.