De novo fragment assembly with short mate-paired reads: Does the read length matter?

Increasing read length is currently viewed as the crucial condition for fragment assembly with next-generation sequencing technologies. However, introducing mate-paired reads (separated by a gap of length, GapLength) opens a possibility to transform short mate-pairs into long mate-reads of length approximately GapLength, and thus raises the question as to whether the read length (as opposed to GapLength) even matters. We describe a new tool, EULER-USR, for assembling mate-paired short reads and use it to analyze the question of whether the read length matters. We further complement the ongoing experimental efforts to maximize read length by a new computational approach for increasing the effective read length. While the common practice is to trim the error-prone tails of the reads, we present an approach that substitutes trimming with error correction using repeat graphs. An important and counterintuitive implication of this result is that one may extend sequencing reactions that degrade with length "past their prime" to where the error rate grows above what is normally acceptable for fragment assembly.

[1]  P. Pevzner 1-Tuple DNA sequencing: computer analysis. , 1989, Journal of biomolecular structure & dynamics.

[2]  C. Caskey,et al.  Closure strategies for random DNA sequencing , 1991 .

[3]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[4]  J. Weber,et al.  Human whole-genome shotgun sequencing. , 1997, Genome research.

[5]  Haixu Tang,et al.  Fragment assembly with double-barreled data , 2001, ISMB.

[6]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[7]  L. Hillier,et al.  PCAP: a whole-genome assembly program. , 2003, Genome research.

[8]  E. Mauceli,et al.  Whole-genome sequence assembly for mammalian genomes: Arachne 2. , 2003, Genome research.

[9]  E. Arner,et al.  Correcting errors in shotgun sequences. , 2003, Nucleic acids research.

[10]  Haixu Tang,et al.  De novo repeat classification and fragment assembly , 2004, RECOMB.

[11]  P. Pevzner,et al.  De Novo Repeat Classification and Fragment Assembly , 2004 .

[12]  Haixu Tang,et al.  Fragment assembly with short reads , 2004, Bioinform..

[13]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[14]  Niall J. Haslam,et al.  An analysis of the feasibility of short read sequencing , 2005, Nucleic acids research.

[15]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[16]  Wing-Kin Sung,et al.  PET-Tool: a software suite for comprehensive processing and managing of Paired-End diTag (PET) sequence data , 2006, BMC Bioinformatics.

[17]  L. Du,et al.  Multiplex sequencing of paired-end ditags (MS-PET): a strategy for the ultra-high-throughput analysis of transcriptomes and genomes , 2006, Nucleic acids research.

[18]  Vincent J. Magrini,et al.  Extending assembly of short DNA sequences to handle error , 2007, Bioinform..

[19]  Eugene W. Myers,et al.  Computability of Models for Sequence Assembly , 2007, WABI.

[20]  Dustin E. Schones,et al.  High-Resolution Profiling of Histone Methylations in the Human Genome , 2007, Cell.

[21]  Richard D. Smith,et al.  Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation. , 2007, Genome research.

[22]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[23]  E. Mardis,et al.  Genome Sequencing Technology and Algorithms , 2007 .

[24]  Juliane C. Dohm,et al.  SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. , 2007, Genome research.

[25]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[26]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[27]  David Hernández,et al.  De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. , 2008, Genome research.

[28]  S. Quake,et al.  Single-Molecule DNA Sequencing of a Viral Genome , 2008, Science.

[29]  Gabor T. Marth,et al.  Whole-genome sequencing and variant discovery in C. elegans , 2008, Nature Methods.

[30]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[31]  Dustin E. Schones,et al.  Genome-wide approaches to studying chromatin modifications , 2008, Nature Reviews Genetics.

[32]  S. Salzberg,et al.  Bioinformatics challenges of new sequencing technology. , 2008, Trends in genetics : TIG.