On the cross-population generalizability of gene expression prediction models

The genetic control of gene expression is a core component of human physiology. For the past several years, transcriptome-wide association studies have leveraged large datasets of linked genotype and RNA sequencing information to create a powerful gene-based test of association that has been used in dozens of studies. While numerous discoveries have been made, the populations in the training data are overwhelmingly of European descent, and little is known about the generalizability of these models to other populations. Here, we test for cross-population generalizability of gene expression prediction models using a dataset of African American individuals with RNA-Seq data in whole blood. We find that the default models trained in large datasets such as GTEx and DGN fare poorly in African Americans, with a notable reduction in prediction accuracy when compared to European Americans. We replicate these limitations in cross-population generalizability using the five populations in the GEUVADIS dataset. Via realistic simulations of both populations and gene expression, we show that accurate cross-population generalizability of transcriptome prediction only arises when eQTL architecture is substantially shared across populations. In contrast, models with non-identical eQTLs showed patterns similar to real-world data. Therefore, generating RNA-Seq data in diverse populations is a critical step towards multi-ethnic utility of gene expression prediction. Author summary Advances in RNA sequencing technology have reduced the cost of measuring gene expression at a genome-wide level. However, sequencing enough human RNA samples for adequately-powered disease association studies remains prohibitively costly. To this end, modern transcriptome-wide association analysis tools leverage existing paired genotype-expression datasets by creating models to predict gene expression using genotypes. These predictive models enable researchers to perform cost-effective association tests with gene expression in independently genotyped samples. However, most of these models use European reference data, and the extent to which gene expression prediction models work across populations is not fully resolved. We observe that these models predict gene expression worse than expected in a dataset of African-Americans when derived from European-descent individuals. Using simulations, we show that gene expression predictive model performance depends on both the amount of shared genotype predictors as well as the genetic relatedness between populations. Our findings suggest a need to carefully select reference populations for prediction and point to a pressing need for more genetically diverse genotype-expression datasets.

[1]  C. Wallace,et al.  Multi‐tissue transcriptome‐wide association studies , 2020, Genetic epidemiology.

[2]  A. Morris,et al.  Investigation of prediction accuracy and the impact of sample size, ancestry, and tissue in transcriptome‐wide association studies , 2020, Genetic epidemiology.

[3]  T. Thornton,et al.  Accuracy of Gene Expression Prediction From Genotype Data With PrediXcan Varies Across and Within Continental Populations , 2019, Front. Genet..

[4]  T. Thornton,et al.  Accuracy of gene expression prediction from genotype data with PrediXcan varies across diverse populations , 2019, bioRxiv.

[5]  Zoltán Kutalik,et al.  Mendelian randomization integrating GWAS and eQTL data reveals genetic determinants of complex and clinical traits , 2019, Nature Communications.

[6]  Christopher R. Gignoux,et al.  Genetic diversity in populations across Latin America: implications for population and medical genetic studies. , 2018, Current opinion in genetics & development.

[7]  Zachary A. Szpiech,et al.  Whole‐Genome Sequencing of Pharmacogenetic Drug Response in Racially Diverse Children with Asthma , 2018, American journal of respiratory and critical care medicine.

[8]  H. Noushmehr,et al.  Multi-Tissue Transcriptome-Wide Association Study Identifies 26 Novel Candidate Susceptibility Genes for High Grade Serous Epithelial Ovarian Cancer , 2018, bioRxiv.

[9]  Charles C. White,et al.  A molecular network of the aging human brain provides insights into the pathology and cognitive decline of Alzheimer’s disease , 2018, Nature Neuroscience.

[10]  Nicola J. Rinaldi,et al.  Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics , 2018, Nature Communications.

[11]  Philippe Pibarot,et al.  A transcriptome-wide association study identifies PALMD as a susceptibility gene for calcific aortic valve stenosis , 2018, Nature Communications.

[12]  Dmitri D. Pervouchine,et al.  The effects of death and post-mortem cold ischemia on human tissue transcriptomes , 2018, Nature Communications.

[13]  Hae Kyung Im,et al.  Genetic architecture of gene expression traits across diverse populations , 2018, bioRxiv.

[14]  E. Green,et al.  Prioritizing diversity in human genomics research , 2017, Nature Reviews Genetics.

[15]  Pedro G. Ferreira,et al.  The effects of death and post-mortem cold ischemia on human tissue transcriptomes , 2018, Nature Communications.

[16]  Hoang T. Nguyen,et al.  Gene expression imputation across multiple brain regions reveals schizophrenia risk throughout development , 2017, bioRxiv.

[17]  Alicia R. Martin,et al.  Haplotype sharing provides insights into fine-scale population history and disease in Finland , 2017, bioRxiv.

[18]  Nicola J. Rinaldi,et al.  Genetic effects on gene expression across human tissues , 2017, Nature.

[19]  Mary Goldman,et al.  Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics , 2016, Nature Communications.

[20]  M. Obeidat,et al.  Leveraging lung tissue transcriptome to uncover candidate causal genes in COPD genetic associations , 2017, bioRxiv.

[21]  Y. Bossé,et al.  A transcriptome-wide association study identifies PALMD as a susceptibility gene for calcific aortic valve stenosis , 2017, bioRxiv.

[22]  Shawneequa L. Callier,et al.  Diversity and inclusion in genomic research: why the uneven progress? , 2017, Journal of Community Genetics.

[23]  Ence Yang,et al.  Systematic analysis of gene expression patterns associated with postmortem interval in human tissues , 2017, Scientific Reports.

[24]  Christopher R. Gignoux,et al.  Human demographic history impacts genetic risk prediction across diverse populations , 2016, bioRxiv.

[25]  D. Hinds,et al.  Gene‐based analysis of regulatory variants identifies 4 putative novel asthma risk genes related to nucleotide synthesis and signaling , 2017, The Journal of allergy and clinical immunology.

[26]  P. Tsai,et al.  Age-dependent changes in mean and variance of gene expression across tissues in a twin cohort , 2016, bioRxiv.

[27]  S. Fullerton,et al.  Genomics is failing on diversity , 2016, Nature.

[28]  M. Halushka,et al.  Complex Sources of Variation in Tissue Expression Data: Analysis of the GTEx Lung Transcriptome. , 2016, American journal of human genetics.

[29]  Alan M. Kwong,et al.  Next-generation genotype imputation service and methods , 2016, Nature Genetics.

[30]  Peter Szolovits,et al.  Genetic Misdiagnoses and the Potential for Health Disparities. , 2016, The New England journal of medicine.

[31]  D. Goldstein,et al.  Unequal representation of genetic variation across ancestry groups creates healthcare inequality in the application of precision medicine , 2016, Genome Biology.

[32]  D. Goldstein,et al.  Unequal representation of genetic variation across ancestry groups creates healthcare inequality in the application of precision medicine , 2016, Genome Biology.

[33]  Shane A. McCarthy,et al.  Reference-based phasing using the Haplotype Reference Consortium panel , 2016, Nature Genetics.

[34]  P. Visscher,et al.  Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets , 2016, Nature Genetics.

[35]  Hae Kyung Im,et al.  Survey of the Heritability and Sparse Architecture of Gene Expression Traits across Human Tissues , 2016, bioRxiv.

[36]  Christopher R. Gignoux,et al.  Making Precision Medicine Socially Precise. Take a Deep Breath. , 2016, American journal of respiratory and critical care medicine.

[37]  M. McCarthy,et al.  Trans-ethnic study design approaches for fine-mapping , 2016, European Journal of Human Genetics.

[38]  Yihui Xie,et al.  A General-Purpose Package for Dynamic Report Generation in R , 2016 .

[39]  T. Lehtimäki,et al.  Integrative approaches for large-scale transcriptome-wide association studies , 2015, Nature Genetics.

[40]  Y. Teo,et al.  Evaluation of transethnic fine mapping with population-specific and cosmopolitan imputation reference panels in diverse Asian populations , 2015, European Journal of Human Genetics.

[41]  N. Powe,et al.  Diversity in Clinical and Biomedical Research: A Promise Yet to Be Fulfilled , 2015, bioRxiv.

[42]  Scott M. Williams,et al.  The Great Migration and African-American Genomic Diversity , 2015, bioRxiv.

[43]  Steve Weston,et al.  Foreach Parallel Adaptor for the 'parallel' Package , 2015 .

[44]  Gabor T. Marth,et al.  An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[45]  Kaanan P. Shah,et al.  A gene-based association method for mapping traits using reference transcriptome data , 2015, Nature Genetics.

[46]  Alexis Dinno,et al.  Nonparametric Pairwise Multiple Comparisons in Independent Groups using Dunn's Test , 2015 .

[47]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[48]  Shuhua Xu,et al.  Analysis of Genome-Wide RNA-Sequencing Data Suggests Age of the CEPH/Utah (CEU) Lymphoblastoid Cell Lines Systematically Biases Gene Expression Profiles , 2015, Scientific Reports.

[49]  Hadley Wickham,et al.  R for Data Science , 2014 .

[50]  Y. Li,et al.  Trans-ethnic genome-wide association studies: advantages and challenges of mapping in diverse populations , 2014, Genome Medicine.

[51]  Y. Li,et al.  Trans-ethnic genome-wide association studies: advantages and challenges of mapping in diverse populations , 2014, Genome Medicine.

[52]  J. Pritchard,et al.  The Effect of Freeze-Thaw Cycles on Gene Expression Levels in Lymphoblastoid Cell Lines , 2014, PloS one.

[53]  D. Koller,et al.  Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals , 2013, Genome research.

[54]  D. Jacobs,et al.  Methylomics of gene expression in human monocytes. , 2013, Human molecular genetics.

[55]  Christopher R. Gignoux,et al.  Socioeconomic status and childhood asthma in urban minority youths. The GALA II and SAGE II studies. , 2013, American journal of respiratory and critical care medicine.

[56]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[57]  Esteban G Burchard,et al.  Early-life air pollution and asthma risk in minority children. The GALA II and SAGE II studies. , 2013, American journal of respiratory and critical care medicine.

[58]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[59]  E. Burchard,et al.  Childhood obesity and asthma control in the GALA II and SAGE II studies. , 2013, American journal of respiratory and critical care medicine.

[60]  H. Wickham Easy pre and post assertions , 2013 .

[61]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[62]  S. Gravel Population Genetics Models of Local Ancestry , 2012, Genetics.

[63]  Pui-Yan Kwok,et al.  Design and coverage of high throughput genotyping arrays optimized for individuals of East Asian, African American, and Latino race/ethnicity using imputation and a novel hybrid SNP selection algorithm. , 2011, Genomics.

[64]  Peter Donnelly,et al.  HAPGEN2: simulation of multiple disease SNPs , 2011, Bioinform..

[65]  Francisco M. De La Vega,et al.  Genomics for the world , 2011, Nature.

[66]  M. Loh,et al.  Ancestry and pharmacogenomics of relapse in acute lymphoblastic leukemia , 2011, Nature Genetics.

[67]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[68]  Sharon R Grossman,et al.  Integrating common and rare genetic variation in diverse human populations , 2010, Nature.

[69]  Pedro M. Valero-Mora,et al.  ggplot2: Elegant Graphics for Data Analysis , 2010 .

[70]  Alex P. Reiner,et al.  Genetic ancestry in lung-function predictions. , 2010, The New England journal of medicine.

[71]  M. Bortolini,et al.  A functional ABCA1 gene variant is associated with low HDL-cholesterol levels and shows evidence of positive selection in Native Americans. , 2010, Human molecular genetics.

[72]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[73]  E. Birney,et al.  Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt , 2009, Nature Protocols.

[74]  D. Reich,et al.  Sensitive Detection of Chromosomal Segments of Distinct Ancestry in Admixed Populations , 2009, PLoS genetics.

[75]  C. Rotimi,et al.  Genetic Variants Associated with Complex Human Diseases Show Wide Variation across Multiple Populations , 2009, Public Health Genomics.

[76]  A C C Gibbs,et al.  Data Analysis , 2009, Encyclopedia of Database Systems.

[77]  D. Koller,et al.  Population genomics of human gene expression , 2007, Nature Genetics.

[78]  Bart De Moor,et al.  BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis , 2005, Bioinform..

[79]  R. Kronmal,et al.  Multi-Ethnic Study of Atherosclerosis: objectives and design. , 2002, American journal of epidemiology.

[80]  M. Marshall,et al.  "Take a deep breath". , 1973, Lancet.