Model-based quality assessment and base-calling for second-generation sequencing data.

Second-generation sequencing (sec-gen) technology can sequence millions of short fragments of DNA in parallel, making it capable of assembling complex genomes for a small fraction of the price and time of previous technologies. In fact, a recently formed international consortium, the 1000 Genomes Project, plans to fully sequence the genomes of approximately 1200 people. The prospect of comparative analysis at the sequence level of a large number of samples across multiple populations may be achieved within the next five years. These data present unprecedented challenges in statistical analysis. For instance, analysis operates on millions of short nucleotide sequences, or reads-strings of A,C,G, or T's, between 30 and 100 characters long-which are the result of complex processing of noisy continuous fluorescence intensity measurements known as base-calling. The complexity of the base-calling discretization process results in reads of widely varying quality within and across sequence samples. This variation in processing quality results in infrequent but systematic errors that we have found to mislead downstream analysis of the discretized sequence read data. For instance, a central goal of the 1000 Genomes Project is to quantify across-sample variation at the single nucleotide level. At this resolution, small error rates in sequencing prove significant, especially for rare variants. Sec-gen sequencing is a relatively new technology for which potential biases and sources of obscuring variation are not yet fully understood. Therefore, modeling and quantifying the uncertainty inherent in the generation of sequence reads is of utmost importance. In this article, we present a simple model to capture uncertainty arising in the base-calling procedure of the Illumina/Solexa GA platform. Model parameters have a straightforward interpretation in terms of the chemistry of base-calling allowing for informative and easily interpretable metrics that capture the variability in sequencing quality. Our model provides these informative estimates readily usable in quality assessment tools while significantly improving base-calling performance.

[1]  Ryan D. Morin,et al.  Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. , 2008, Genome research.

[2]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[3]  Raja Jothi,et al.  Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data , 2008, Nucleic acids research.

[4]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[5]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[6]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[7]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[8]  R. Pearl Biometrics , 1914, The American Naturalist.

[9]  Clive Brown,et al.  Toward the $1000 human genome , 2005 .

[10]  R. Myers,et al.  An Integrated Software System for Analyzing Chip-chip and Chip-seq Data (supplementary Information) , 2008 .

[11]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[12]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[13]  T. Speed,et al.  An estimate of the crosstalk matrix in four‐dye fluorescence‐based DNA sequencing , 1999, Electrophoresis.

[14]  S. Batzoglou,et al.  Genome-Wide Analysis of Transcription Factor Binding Sites Based on ChIP-Seq Data , 2008, Nature Methods.

[15]  Heidi Ledford,et al.  The death of microarrays? , 2008, Nature.

[16]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[17]  Clive Brown,et al.  Toward the 1,000 dollars human genome. , 2005, Pharmacogenomics.

[18]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[20]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[21]  Ioannis Xenarios,et al.  BMC Bioinformatics BioMed Central Methodology article Probabilistic base calling of Solexa sequencing data , 2022 .

[22]  Erika Check Hayden,et al.  International genome project launched , 2008, Nature.

[23]  T. Mikkelsen,et al.  Genome-wide maps of chromatin state in pluripotent and lineage-committed cells , 2007, Nature.

[24]  P. Mitra,et al.  Alta-Cyclic: a self-optimizing base caller for next-generation sequencing , 2008, Nature Methods.

[25]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[26]  C. T. Farley,et al.  Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome , 2008 .