Using BioBin to Explore Rare Variant Population Stratification

Rare variants (RVs) will likely explain additional heritability of many common complex diseases; however, the natural frequencies of rare variation across and between human populations are largely unknown. We have developed a powerful, flexible collapsing method called BioBin that utilizes prior biological knowledge using multiple publicly available database sources to direct analyses. Variants can be collapsed according to functional regions, evolutionary conserved regions, regulatory regions, genes, and/or pathways without the need for external files. We conducted an extensive comparison of rare variant burden differences (MAF < 0.03) between two ancestry groups from 1000 Genomes Project data, Yoruba (YRI) and European descent (CEU) individuals. We found that 56.86% of gene bins, 72.73% of intergenic bins, 69.45% of pathway bins, 32.36% of ORegAnno annotated bins, and 9.10% of evolutionary conserved regions (shared with primates) have statistically significant differences in RV burden. Ongoing efforts include examining additional regional characteristics using regulatory regions and protein binding domains. Our results show interesting variant differences between two ancestral populations and demonstrate that population stratification is a pervasive concern for sequence analyses.

[1]  Jacob A. Tennessen,et al.  Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes , 2012, Science.

[2]  Wei Pan,et al.  A Data-Adaptive Sum Test for Disease Association with Multiple Common or Rare Variants , 2010, Human Heredity.

[3]  Gary D Bader,et al.  NetPath: a public resource of curated signal transduction pathways , 2010, Genome Biology.

[4]  Emily H Turner,et al.  Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome , 2010, Nature Genetics.

[5]  Hong-Wen Deng,et al.  Efficient Utilization of Rare Variants for Detection of Disease-Related Genomic Regions , 2010, PloS one.

[6]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[7]  Iuliana Ionita-Laza,et al.  A New Testing Strategy to Identify Rare Variants with Either Risk or Protective Effect on Disease , 2011, PLoS genetics.

[8]  David Haussler,et al.  The UCSC genome browser database: update 2007 , 2006, Nucleic Acids Res..

[9]  Lincoln Stein,et al.  Reactome: a database of reactions, pathways and biological processes , 2010, Nucleic Acids Res..

[10]  Lili Ding,et al.  Effect of population stratification analysis on false-positive rates for common and rare variants , 2011, BMC proceedings.

[11]  Alkes L. Price,et al.  New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[12]  Claudio J. Verzilli,et al.  An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People , 2012, Science.

[13]  Pedro M. Valero-Mora,et al.  ggplot2: Elegant Graphics for Data Analysis , 2010 .

[14]  S. Browning,et al.  A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic , 2009, PLoS genetics.

[15]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[16]  Sivakumar Gowrisankar,et al.  A rare penetrant mutation in CFH confers high risk of age-related macular degeneration , 2011, Nature Genetics.

[17]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[18]  Shamil R Sunyaev,et al.  Pooled association tests for rare variants in exon-resequencing studies. , 2010, American journal of human genetics.

[19]  V. Salomaa,et al.  Excess of rare variants in genes identified by genome-wide association study of hypertriglyceridemia , 2010, Nature Genetics.

[20]  Hongzhe Li,et al.  A powerful test for multiple rare variants association studies that incorporates sequencing qualities , 2012, Nucleic acids research.

[21]  S. Leal,et al.  Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. , 2008, American journal of human genetics.

[22]  M. G. Reese,et al.  A probabilistic disease-gene finder for personal genomes. , 2011, Genome research.

[23]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[24]  Livia Perfetto,et al.  MINT, the molecular interaction database: 2009 update , 2009, Nucleic Acids Res..

[25]  Michelle Whirl-Carrillo,et al.  From pharmacogenomic knowledge acquisition to clinical applications: the PharmGKB as a clinical pharmacogenomic biomarker resource. , 2011, Biomarkers in medicine.

[26]  Donald W. Bowden,et al.  Candidate genes for non-diabetic ESRD in African Americans: a genome-wide association study using pooled DNA , 2010, Human Genetics.

[27]  Marek Kimmel,et al.  Forward-Time Simulations of Human Populations with Complex Diseases , 2007, PLoS genetics.

[28]  W. Thilly,et al.  A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). , 2007, Mutation research.

[29]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[30]  Alan Hodgkinson,et al.  Variation in the mutation rate across mammalian genomes , 2011, Nature Reviews Genetics.

[31]  Mary Goldman,et al.  The UCSC Genome Browser database: update 2011 , 2010, Nucleic Acids Res..

[32]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[33]  H. Ellegren,et al.  Mutation rate variation in the mammalian genome. , 2003, Current opinion in genetics & development.

[34]  Xihong Lin,et al.  Rare-variant association testing for sequencing data with the sequence kernel association test. , 2011, American journal of human genetics.

[35]  Kara Dolinski,et al.  The BioGRID Interaction Database: 2011 update , 2010, Nucleic Acids Res..

[36]  Christie S. Chang,et al.  The BioGRID interaction database: 2013 update , 2012, Nucleic Acids Res..

[37]  Rachael P. Huntley,et al.  The UniProt-GO Annotation database in 2011 , 2011, Nucleic Acids Res..

[38]  Gaurav Bhatia,et al.  A Covering Method for Detecting Genetic Associations between Rare Variants and Common Phenotypes , 2010, PLoS Comput. Biol..

[39]  Thomas Meitinger,et al.  Exome sequencing identifies ACAD9 mutations as a cause of complex I deficiency , 2010, Nature Genetics.

[40]  John S. Witte,et al.  Comprehensive Approach to Analyzing Rare Genetic Variants , 2010, PloS one.

[41]  Obi L. Griffith,et al.  ORegAnno: an open-access community-driven resource for regulatory annotation , 2007, Nucleic Acids Res..