Near-optimal assembly for shotgun sequencing with noisy reads

Recent work identified the fundamental limits on the information requirements in terms of read length and coverage depth required for successful de novo genome reconstruction from shotgun sequencing data, based on the idealistic assumption of no errors in the reads (noiseless reads). In this work, we show that even when there is noise in the reads, one can successfully reconstruct with information requirements close to the noiseless fundamental limit. A new assembly algorithm, X-phased Multibridging, is designed based on a probabilistic model of the genome. It is shown through analysis to perform well on the model, and through simulations to perform well on real genomes.

[1]  Mihai Pop,et al.  Genome assembly reborn: recent computational challenges , 2009, Briefings Bioinform..

[2]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[3]  Julian Parkhill,et al.  Microbiology in the post-genomic era , 2008, Nature Reviews Microbiology.

[4]  Kannan Ramchandran,et al.  Optimal DNA shotgun sequencing: Noisy reads are as good as noiseless reads , 2013, 2013 IEEE International Symposium on Information Theory.

[5]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[6]  R. Knight,et al.  The Human Microbiome Project , 2007, Nature.

[7]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[8]  B. Mishra,et al.  Comparing De Novo Genome Assembly: The Long and Short of It , 2011, PloS one.

[9]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[10]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[11]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[12]  Timothy P. L. Smith,et al.  Reducing assembly complexity of microbial genomes with single-molecule sequencing , 2013, Genome Biology.

[13]  R. Wilson,et al.  What is finished, and why does it matter. , 2002, Genome research.

[14]  David Tse,et al.  Optimal assembly for high throughput shotgun sequencing , 2013, BMC Bioinformatics.

[15]  Jocelyn Kaiser,et al.  A Plan to Capture Human Diversity in 1000 Genomes , 2008, Science.

[16]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[17]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[18]  P. Green,et al.  Consed: a graphical tool for sequence finishing. , 1998, Genome research.

[19]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[20]  S. Eddy,et al.  Automated de novo identification of repeat sequence families in sequenced genomes. , 2002, Genome research.