Information Theory of DNA Shotgun Sequencing

DNA sequencing is the basic workhorse of modern day biology and medicine. Shotgun sequencing is the dominant technique used: many randomly located short fragments called reads are extracted from the DNA sequence, and these reads are assembled to reconstruct the original sequence. A basic question is: given a sequencing technology and the statistics of the DNA sequence, what is the minimum number of reads required for reliable reconstruction? This number provides a fundamental limit to the performance of any assembly algorithm. For a simple statistical model of the DNA sequence and the read process, we show that the answer admits a critical phenomenon in the asymptotic limit of long DNA sequences: if the read length is below a threshold, reconstruction is impossible no matter how many reads are observed, and if the read length is above the threshold, having enough reads to cover the DNA sequence is sufficient to reconstruct. The threshold is computed in terms of the Renyi entropy rate of the DNA sequence. We also study the impact of noise in the read process on the performance.

[1]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[2]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[3]  Fady Alajaji,et al.  Rényi's divergence and entropy rates for finite alphabet Markov sources , 2001, IEEE Trans. Inf. Theory.

[4]  Niall J. Haslam,et al.  An analysis of the feasibility of short read sequencing , 2005, Nucleic acids research.

[5]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[6]  L. Gordon,et al.  Poisson Approximation and the Chen-Stein Method , 1990 .

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  Haim Kaplan,et al.  The greedy algorithm for shortest superstrings , 2005, Inf. Process. Lett..

[9]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[10]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[11]  Vincent J. Magrini,et al.  Extending assembly of short DNA sequences to handle error , 2007, Bioinform..

[12]  Martin E. Dyer,et al.  The Probability of Unique Solutions of Sequencing by Hybridization , 1994, J. Comput. Biol..

[13]  Esko Ukkonen A linear-time algorithm for finding approximate shortest common superstrings , 2005, Algorithmica.

[14]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[15]  David Tse,et al.  Optimal assembly for high throughput shotgun sequencing , 2013, BMC Bioinformatics.

[16]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[17]  Jonathan S. Turner,et al.  Approximation Algorithms for the Shortest Common Superstring Problem , 1989, Inf. Comput..

[18]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[19]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Juliane C. Dohm,et al.  SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. , 2007, Genome research.

[21]  Alan M. Frieze,et al.  Greedy Algorithms for the Shortest Common Superstring That Are Asymptotically Optimal , 1998, Algorithmica.

[22]  Paul Medvedev,et al.  Maximum Likelihood Genome Assembly , 2009, J. Comput. Biol..

[23]  Ming Li,et al.  Towards a DNA sequencing theory (learning a string) , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[24]  Kannan Ramchandran,et al.  Optimal DNA shotgun sequencing: Noisy reads are as good as noiseless reads , 2013, 2013 IEEE International Symposium on Information Theory.

[25]  Gesine Reinert,et al.  Poisson Process Approximation for Sequence Repeats and Sequencing by Hybridization , 1996, J. Comput. Biol..

[26]  L. Gordon,et al.  Two moments su ce for Poisson approx-imations: the Chen-Stein method , 1989 .

[27]  John Gallant String compression algorithms , 1982 .

[28]  Konrad H. Paszkiewicz,et al.  De novo assembly of short sequence reads , 2010, Briefings Bioinform..

[29]  Bin Ma Why greed works for shortest common superstring problem , 2009, Theor. Comput. Sci..

[30]  Owen White,et al.  TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects , 1995 .

[31]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[32]  Mihai Pop,et al.  Genome assembly reborn: recent computational challenges , 2009, Briefings Bioinform..