Poisson Process Approximation for Sequence Repeats and Sequencing by Hybridization

Sequencing by hybridization is a tool to determine a DNA sequence from the unordered list of all l-tuples contained in this sequence; typical numbers for l are l = 8, 10, 12. For theoretical purposes we assume that the multiset of all l-tuples is known. This multiset determines the DNA sequence uniquely if none of the so-called Ukkonen transformations are possible. These transformations require repeats of (l-1)-tuples in the sequence, with these repeats occurring in certain spatial patterns. We model DNA as an i.i.d. sequence. We first prove Poisson process approximations for the process of indicators of all leftmost long repeats allowing self-overlap and for the process of indicators of all left-most long repeats without self-overlap. Using the Chen-Stein method, we get bounds on the error of these approximations. As a corollary, we approximate the distribution of longest repeats. In the second step we analyze the spatial patterns of the repeats. Finally we combine these two steps to prove an approximation for the probability that a random sequence is uniquely recoverable from its list of l-tuples. For all our results we give some numerical examples including error bounds.

[1]  Gesine Reinert,et al.  Poisson Process Approximation for Repeats in One Sequence and Its Application to Sequencing by Hybridization , 1996, CPM.

[2]  W. Bains,et al.  A novel method for nucleic acid sequence determination. , 1988, Journal of theoretical biology.

[3]  L. Gordon,et al.  Poisson Approximation and the Chen-Stein Method , 1990 .

[4]  Simon Tavaré,et al.  Review: D. Aldous, Probability Approximations via the Poisson Clumping Heuristic; A. D. Barbour, L. Holst, S. Janson, Poisson Approximation , 1993 .

[5]  Saburō Shiroyama In Los Angeles , 1989, Made in Japan and other Japanese “Business Novels”.

[6]  S. P. Fodor,et al.  Light-directed, spatially addressable parallel chemical synthesis. , 1991, Science.

[7]  Serguei Novak Long match patterns in random sequences , 1995 .

[8]  P. Pevzner,et al.  Improved chips for sequencing by hybridization. , 1991, Journal of biomolecular structure & dynamics.

[9]  Martin E. Dyer,et al.  The Probability of Unique Solutions of Sequencing by Hybridization , 1994, J. Comput. Biol..

[10]  M. Waterman,et al.  The Erdos-Renyi Law in Distribution, for Coin Tossing and Sequence Matching , 1990 .

[11]  S. P. Fodor,et al.  Light-generated oligonucleotide arrays for rapid DNA sequence analysis. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Michael S. Waterman,et al.  An Erdös-Rényi law with shifts , 1985 .

[13]  P. Pevzner 1-Tuple DNA sequencing: computer analysis. , 1989, Journal of biomolecular structure & dynamics.

[14]  S. P. Fodor,et al.  Multiplexed biochemical assays with biological chips , 1993, Nature.

[15]  Michael S. Waterman,et al.  An Extreme Value Theory for Sequence Matching , 1986 .

[16]  Richard M. Wilson,et al.  A course in combinatorics , 1992 .

[17]  L. Gordon,et al.  Two moments su ce for Poisson approx-imations: the Chen-Stein method , 1989 .

[18]  Samuel Karlin,et al.  Counts of long aligned word matches among random letter sequences , 1987, Advances in Applied Probability.

[19]  K. Khrapko,et al.  [Determination of the nucleotide sequence of DNA using hybridization with oligonucleotides. A new method]. , 1988, Doklady Akademii nauk SSSR.

[20]  A. Barbour,et al.  Poisson Approximation , 1992 .

[21]  Susan R. Wilson INTRODUCTION TO COMPUTATIONAL BIOLOGY: MAPS, SEQUENCES AND GENOMES. , 1996 .

[22]  A. M. Zubkov,et al.  Limit Distribution of Random Variables Associated with Multiple Long Duplications in a Sequence of Independent Trials , 1974 .

[23]  Gesine Reinert,et al.  Probabilistic Aspects of Sequence Repeats and Sequencing by Hybridization , 1997 .

[24]  Michael S. Waterman,et al.  Introduction to Computational Biology: Maps, Sequences and Genomes , 1998 .

[25]  D. Aldous Probability Approximations via the Poisson Clumping Heuristic , 1988 .

[26]  A D Mirzabekov,et al.  [DNA sequencing by hybridization with oligonucleotides immobilized in a gel. Chemical ligation as a method of expanding the prospects for the method]. , 1994, Molekuliarnaia biologiia.

[27]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[28]  Louis H. Y. Chen Poisson Approximation for Dependent Trials , 1975 .

[29]  Pavel A. Pevzner,et al.  Towards DNA Sequencing Chips , 1994, MFCS.