Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis

BackgroundThe current versions of reference genome assemblies still contain gaps represented by stretches of Ns. Since high throughput sequencing reads cannot be mapped to those gap regions, the regions are depleted of experimental data. Moreover, several technology platforms assay a targeted portion of the genomic sequence, meaning that regions from the unassayed portion of the genomic sequence cannot be detected in those experiments. We here refer to all such regions as inaccessible regions, and hypothesize that ignoring these regions in the null model may increase false findings in statistical testing of colocalization of genomic features.ResultsOur explorative analyses confirm that the genomic regions in public genomic tracks intersect very little with assembly gaps of human reference genomes (hg19 and hg38). The little intersection was observed only at the beginning and end portions of the gap regions. Further, we simulated a set of synthetic tracks by matching the properties of real genomic tracks in a way that nullified any true association between them. This allowed us to test our hypothesis that not avoiding inaccessible regions (as represented by assembly gaps) in the null model would result in spurious inflation of statistical significance. We contrasted the distributions of test statistics and p-values of Monte Carlo-based permutation tests that either avoided or did not avoid assembly gaps in the null model when testing colocalization between a pair of tracks. We observed that the statistical tests that did not account for assembly gaps in the null model resulted in a distribution of the test statistic that is shifted to the right and a distribution of p-values that is shifted to the left (indicating inflated significance). We observed a similar level of inflated significance in hg19 and hg38, despite assembly gaps covering a smaller proportion of the latter reference genome.ConclusionWe provide empirical evidence demonstrating that inaccessible regions, even when covering only a few percentages of the genome, can lead to a substantial amount of false findings if not accounted for in statistical colocalization analysis.

[1]  E. Lander,et al.  Finishing the euchromatic sequence of the human genome , 2004 .

[2]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[3]  G. Petsko And they said it wouldn't last... , 2010, Genome Biology.

[4]  G. K. Sandve,et al.  The Genomic HyperBrowser: inferential genomics at the sequence level , 2010, Genome Biology.

[5]  E. Lander Initial impact of the sequencing of the human genome , 2011, Nature.

[6]  Olga G. Troyanskaya,et al.  An effective statistical evaluation of ChIPseq dataset similarity , 2012, Bioinform..

[7]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2011, Nature Reviews Genetics.

[8]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[9]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[10]  Andrey A. Mironov,et al.  Exploring Massive, Genome Scale Datasets with the GenometriCorr Package , 2012, PLoS Comput. Biol..

[11]  A. Quinlan BEDTools: The Swiss‐Army Tool for Genome Feature Analysis , 2014, Current protocols in bioinformatics.

[12]  Brent S. Pedersen,et al.  The dilemma of choosing the ideal permutation strategy while estimating statistical significance of genome-wide enrichment , 2014, Briefings Bioinform..

[13]  Geir Kjetil Sandve,et al.  Monte Carlo null models for genomic data , 2014, 1404.5970.

[14]  Nathan C. Sheffield,et al.  LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor , 2015, Bioinform..

[15]  R. Durbin,et al.  Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly , 2016, bioRxiv.

[16]  Elena D. Stavrovskaya,et al.  StereoGene: Rapid Estimation of Genomewide Correlation of Continuous or Interval Feature Data , 2017, bioRxiv.

[17]  Elena D. Stavrovskaya,et al.  StereoGene: Rapid Estimation of Genomewide Correlation of Continuous or Interval Feature Data , 2017, bioRxiv.

[18]  Finn Drabløs,et al.  GSuite HyperBrowser: integrative analysis of dataset collections across the genome and epigenome , 2016, bioRxiv.

[19]  Brent S. Pedersen,et al.  GIGGLE: a search engine for large-scale integrated genome analysis , 2017, Nature Methods.

[20]  Diana Domanska,et al.  Coloc-stats: a unified web interface to perform colocalization analysis of genomic features , 2018, Nucleic Acids Res..