Information Theory of DNA Sequencing

DNA sequencing is the basic workhorse of modern day biology and medicine. Shotgun sequencing is the dominant technique used: many randomly located short fragments called reads are extracted from the DNA sequence, and these reads are assembled to reconstruct the original sequence. A basic question is: given a sequencing technology and the statistics of the DNA sequence, what is the minimum number of reads required for reliable reconstruction? This number provides a fundamental limit to the performance of any assembly algorithm. By drawing an analogy between the DNA se-quencing problem and the classic communication problem, we formulate this question in terms of an information theoretic notion of sequencing capacity. This is the asymp-totic ratio of the length of the DNA sequence to the minimum number of reads required to reconstruct it reliably. We compute the sequencing capacity explicitly for a simple statistical model of the DNA sequence and the read process. Using this framework, we also study the impact of noise in the read process on the sequencing capacity.

[1]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[2]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Haim Kaplan,et al.  The greedy algorithm for shortest superstrings , 2005, Inf. Process. Lett..

[4]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[5]  Martin E. Dyer,et al.  The Probability of Unique Solutions of Sequencing by Hybridization , 1994, J. Comput. Biol..

[6]  Ming Li,et al.  Towards a DNA sequencing theory (learning a string) , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[7]  Esko Ukkonen A linear-time algorithm for finding approximate shortest common superstrings , 2005, Algorithmica.

[8]  Owen White,et al.  TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects , 1995 .

[9]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[10]  Vincent J. Magrini,et al.  Extending assembly of short DNA sequences to handle error , 2007, Bioinform..

[11]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[12]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Mihai Pop,et al.  Genome assembly reborn: recent computational challenges , 2009, Briefings Bioinform..

[14]  Juliane C. Dohm,et al.  SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. , 2007, Genome research.

[15]  Alan M. Frieze,et al.  Greedy Algorithms for the Shortest Common Superstring That Are Asymptotically Optimal , 1998, Algorithmica.

[16]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[17]  Gesine Reinert,et al.  Poisson Process Approximation for Sequence Repeats and Sequencing by Hybridization , 1996, J. Comput. Biol..

[18]  John Gallant String compression algorithms , 1982 .

[19]  L. Gordon,et al.  Poisson Approximation and the Chen-Stein Method , 1990 .

[20]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[21]  Fady Alajaji,et al.  Rényi's divergence and entropy rates for finite alphabet Markov sources , 2001, IEEE Trans. Inf. Theory.

[22]  Konrad H. Paszkiewicz,et al.  De novo assembly of short sequence reads , 2010, Briefings Bioinform..

[23]  Niall J. Haslam,et al.  An analysis of the feasibility of short read sequencing , 2005, Nucleic acids research.

[24]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[25]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[26]  Jonathan S. Turner,et al.  Approximation Algorithms for the Shortest Common Superstring Problem , 1989, Inf. Comput..

[27]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.