Error correction of high-throughput sequencing datasets with non-uniform coverage

Motivation: The continuing improvements to high-throughput sequencing (HTS) platforms have begun to unfold a myriad of new applications. As a result, error correction of sequencing reads remains an important problem. Though several tools do an excellent job of correcting datasets where the reads are sampled close to uniformly, the problem of correcting reads coming from drastically non-uniform datasets, such as those from single-cell sequencing, remains open. Results: In this article, we develop the method Hammer for error correction without any uniformity assumptions. Hammer is based on a combination of a Hamming graph and a simple probabilistic model for sequencing errors. It is a simple and adaptable algorithm that improves on other tools on non-uniform single-cell data, while achieving comparable results on normal multi-cell data. Availability: http://www.cs.toronto.edu/~pashadag. Contact: pmedvedev@cs.ucsd.edu

[1]  Srinivas Aluru,et al.  Reptile: representative tiling for short read error correction , 2010, Bioinform..

[2]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[3]  B. Ahring,et al.  Specific single-cell isolation and genomic amplification of uncultured microorganisms , 2007, Applied Microbiology and Biotechnology.

[4]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[5]  Garth D Ehrlich,et al.  Insights into the Genome of Large Sulfur Bacteria Revealed by Analysis of Single Filaments , 2007, PLoS biology.

[6]  Jan Schröder,et al.  Genome analysis SHREC : a short-read error correction method , 2009 .

[7]  Benjamin J. Raphael,et al.  The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families , 2007, PLoS biology.

[8]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[9]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[10]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[11]  Sallie W. Chisholm,et al.  Whole Genome Amplification and De novo Assembly of Single Bacterial Cells , 2009, PloS one.

[12]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[13]  Gayle M. Wittenberg,et al.  EDAR: An Efficient Error Detection and Removal Algorithm for Next Generation Sequencing Data , 2010, J. Comput. Biol..

[14]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[15]  Yutaka Suzuki,et al.  Recount: expectation maximization based error correction tool for next generation sequencing data. , 2009, Genome informatics. International Conference on Genome Informatics.

[16]  Weiguo Liu,et al.  A Parallel Algorithm for Error Correction in High-Throughput Short-Read Data on CUDA-Enabled Graphics Hardware , 2010, J. Comput. Biol..

[17]  Lucian Ilie,et al.  HiTEC: accurate error correction in high-throughput sequencing data , 2011, Bioinform..

[18]  C. DeLisi,et al.  Phenotypic connections in surprising places , 2010, Genome Biology.

[19]  M. Pop,et al.  Metagenomic Analysis of the Human Distal Gut Microbiome , 2006, Science.

[20]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[22]  Genome 10 K : A Proposal to Obtain Whole-Genome Sequence for 10 000 Vertebrate Species GENOME 10 K COMMUNITY OF SCIENTISTS * , 2009 .

[23]  S. Morishita,et al.  Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing. , 2009, Genome research.

[24]  R. Lasken,et al.  Genomic DNA Amplification from a Single Bacterium , 2005, Applied and Environmental Microbiology.

[25]  Alexander J. Hartemink,et al.  A generalized model for multi-marker analysis of cell cycle progression in synchrony experiments , 2011, Bioinform..

[26]  W. Wong,et al.  Modeling non-uniformity in short-read rates in RNA-Seq data , 2010, Genome Biology.

[27]  Leena Salmela,et al.  Correction of sequencing errors in a mixed set of reads , 2010, Bioinform..