Estimation and Correction for Gc-content Bias in High Throughput Sequencing

GC-content bias describes the dependence between fragment count (read coverage) and GC content found in high-throughput sequencing assays, particularly the Illumina Genome Analyzer technology. This bias can dominate the signal of interest for analyses that focus on measuring fragment abundance within a genome, such as copy number estimation. The bias is not consistent between samples, and current methods to remove it in a single sample do not assume any knowledge of the curve shape or scale. In this work we analyze regularities in the GC-bias patterns, and find a compact description for this curve family. It is the GC content of the full DNA fragment, not only the sequenced read, that most influences fragment count. This GC effect is unimodal: both GC rich fragments and AT rich fragments are under-represented in the sequencing results. Based on these findings, we propose a new method to calculate predicted coverage and correct for the bias. This parsimonious model produces single bp prediction which suffices to predict the GC effect on fragment coverage at all scales, all chromosomes and for both strands; this allows optimal GC-effect correction regardless of the downstream smoothing or binning. We demonstrate our model’s potential for improving on current approaches to copy-number estimation. These GC-modeling considerations can also inform other high-throughput sequencing analyses such as ChIP-seq and RNA-seq. Finally, our analysis provides empirical evidence strengthening the hypothesis that PCR is the most important cause of the GC bias.

[1]  Sündüz Keleş,et al.  A Statistical Framework for the Analysis of ChIP-Seq Data , 2011, Journal of the American Statistical Association.

[2]  Emmanuel Barillot,et al.  Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization , 2010, Bioinform..

[3]  T. Fennell,et al.  Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries , 2011, Genome Biology.

[4]  M. Eisen,et al.  Impact of Chromatin Structures on DNA Processing for Genomic Analyses , 2009, PloS one.

[5]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[6]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[7]  Héctor Corrada Bravo,et al.  Model-based quality assessment and base-calling for second-generation sequencing data. , 2010, Biometrics.

[8]  Christopher A. Miller,et al.  ReadDepth: A Parallel R Package for Detecting Copy Number Alterations from Short Sequencing Reads , 2011, PloS one.

[9]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[10]  Margaret C. Linak,et al.  Sequence-specific error profile of Illumina sequencers , 2011, Nucleic acids research.

[11]  Richard Durbin,et al.  A large genome center's improvements to the Illumina sequencing system , 2008, Nature Methods.

[12]  Kenny Q. Ye,et al.  Sensitive and accurate detection of copy number variants using read depth of coverage. , 2009, Genome research.

[13]  Derek Y. Chiang,et al.  High-resolution mapping of copy-number alterations with massively parallel sequencing , 2009, Nature Methods.

[14]  Simon Tavaré,et al.  CNAseg - a novel framework for identification of copy number changes in cancer from second-generation sequencing data , 2010, Bioinform..

[15]  Z. Ning,et al.  Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of GC-biased genomes , 2009, Nature Methods.

[16]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[17]  J. Ahringer,et al.  Systematic bias in high-throughput sequencing data and its correction by BEADS , 2011, Nucleic acids research.

[18]  K. Hansen,et al.  Biases in Illumina transcriptome sequencing caused by random hexamer priming , 2010, Nucleic acids research.

[19]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .