Umap and Bismap: quantifying genome and methylome mappability

Abstract Short-read sequencing enables assessment of genetic and biochemical traits of individual genomic regions, such as the location of genetic variation, protein binding and chemical modifications. Every region in a genome assembly has a property called ‘mappability’, which measures the extent to which it can be uniquely mapped by sequence reads. In regions of lower mappability, estimates of genomic and epigenomic characteristics from sequencing assays are less reliable. These regions have increased susceptibility to spurious mapping from reads from other regions of the genome with sequencing errors or unexpected genetic variation. Bisulfite sequencing approaches used to identify DNA methylation exacerbate these problems by introducing large numbers of reads that map to multiple regions. Both to correct assumptions of uniformity in downstream analysis and to identify regions where the analysis is less reliable, it is necessary to know the mappability of both ordinary and bisulfite-converted genomes. We introduce the Umap software for identifying uniquely mappable regions of any genome. Its Bismap extension identifies mappability of the bisulfite-converted genome. A Umap and Bismap track hub for human genome assemblies GRCh37/hg19 and GRCh38/hg38, and mouse assemblies GRCm37/mm9 and GRCm38/mm10 is available at https://bismap.hoffmanlab.org for use with genome browsers.

[1]  Mick Watson,et al.  Errors in RNA-Seq quantification affect genes of relevance to human disease , 2015, Genome Biology.

[2]  Harris A. Jaffee,et al.  Redefining CpG islands using hidden Markov models. , 2010, Biostatistics.

[3]  Antony V. Cox,et al.  The Ensembl Web site: mechanics of a genome browser. , 2004, Genome research.

[4]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[5]  Jie Lv,et al.  DiseaseMeth: a human disease methylation database , 2011, Nucleic Acids Res..

[6]  Felix Krueger,et al.  Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications , 2011, Bioinform..

[7]  Melissa J. Landrum,et al.  RefSeq: an update on mammalian reference sequences , 2013, Nucleic Acids Res..

[8]  David G. Knowles,et al.  Fast Computation and Applications of Genome Mappability , 2012, PloS one.

[9]  Raymond K. Auerbach,et al.  PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls , 2009, Nature Biotechnology.

[10]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[11]  G. Hannon,et al.  An epigenetic memory of pregnancy in the mouse mammary gland. , 2015, Cell reports.

[12]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[13]  Zhifu Sun,et al.  Base resolution methylome profiling: considerations in platform selection, data preprocessing and analysis , 2015, Epigenomics.

[14]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[15]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[16]  R. Weksberg,et al.  Cross-reactive DNA microarray probes lead to false discovery of autosomal sex-associated DNA methylation. , 2012, American journal of human genetics.

[17]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[18]  Terrence S. Furey,et al.  The UCSC Table Browser data retrieval tool , 2004, Nucleic Acids Res..

[19]  Colin N. Dewey,et al.  RNA-Seq gene expression estimation with read mapping uncertainty , 2009, Bioinform..

[20]  A. Gnirke,et al.  Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis , 2005, Nucleic acids research.

[21]  R. Weksberg,et al.  Discovery of cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray , 2013, Epigenetics.

[22]  Sündüz Keleş,et al.  A Statistical Framework for the Analysis of ChIP-Seq Data , 2011, Journal of the American Statistical Association.

[23]  Ion I. Mandoiu,et al.  Estimation of alternative splicing isoform frequencies from RNA-Seq data , 2010, Algorithms for Molecular Biology.

[24]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[25]  K. Gunderson,et al.  High density DNA methylation array with single CpG site resolution. , 2011, Genomics.

[26]  L. Lin,et al.  A concordance correlation coefficient to evaluate reproducibility. , 1989, Biometrics.

[27]  B. Langmead,et al.  BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions , 2012, Genome Biology.

[28]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[29]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[30]  Mark D. Robinson,et al.  Statistical methods for detecting differentially methylated loci and regions , 2014, Front. Genet..

[31]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[32]  J. Ahringer,et al.  Systematic bias in high-throughput sequencing data and its correction by BEADS , 2011, Nucleic acids research.