Euler circuits and DNA sequencing by hybridization

Sequencing by hybridization is a method of reconstructing a long DNA string — that is, figuring out its nucleotide sequence — from knowledge of its short substrings. Unique reconstruction is not always possible, and the goal of this paper is to study the number of reconstructions of a random string. For a given string, the number of reconstructions is determined by the pattern of repeated substrings; in an appropriate limit substrings will occur at most twice, so the pattern of repeats is given by a pairing: a string of length 2n in which each symbol occurs twice. A pairing induces a 2-in, 2-out graph, whose directed edges are defined by successive symbols of the pairing — for example the pairing ABBCAC induces the graph with edges AB, BB, BC, and so forth — and the number of reconstructions is simply the number of Euler circuits in this 2-in, 2-out graph. The original problem is thus transformed into one about pairings: to find the number fk(n) of n-symbol pairings having k Euler circuits. We show how to compute this function, in closed form, for any fixed k, and we present the functions explicitly for k=1,…,9. The key is a decomposition theorem: the Euler “circuit number” of a pairing is the product of the circuit numbers of “component” sub-pairings. These components come from connected components of the “interlace graph”, which has the pairing's symbols as vertices, and edges when symbols are “interlaced”. (A and B are interlaced if the pairing has the form ABAB or BABA.) We carry these results back to the original question about DNA strings, and provide a total variation distance upper bound for the approximation error. We perform an asymptotic enumeration of 2-in, 2-out digraphs to show that, for a typical random n-pairing, the number of Euler circuits is of order no smaller than 2n/n, and the expected number is asymptotically at least e−1/22n−1/n. Since any n-pairing has at most 2n−1 Euler circuits, this pinpoints the exponential growth rate.

[1]  Kenneth P. Bogart,et al.  Non-sexist solution of the menage problem , 1986 .

[2]  M. Dehn Über kombinatorische Topologie , 1936 .

[3]  W. T. Tutte The dissection of equilateral triangles into equilateral triangles , 1948, Mathematical Proceedings of the Cambridge Philosophical Society.

[4]  Gesine Reinert,et al.  Poisson Process Approximation for Sequence Repeats and Sequencing by Hybridization , 1996, J. Comput. Biol..

[5]  W. T. Tutte,et al.  On Unicursal Paths in a Network of Degree 4 , 1941 .

[6]  A. Itai,et al.  QUEUES, STACKS AND GRAPHS , 1971 .

[7]  L. Gordon,et al.  Two moments su ce for Poisson approx-imations: the Chen-Stein method , 1989 .

[8]  Martin E. Dyer,et al.  The Probability of Unique Solutions of Sequencing by Hybridization , 1994, J. Comput. Biol..

[9]  Gesine Reinert,et al.  Poisson Process Approximation for Repeats in One Sequence and Its Application to Sequencing by Hybridization , 1996, CPM.

[10]  Neil J. A. Sloane,et al.  The encyclopedia of integer sequences , 1995 .

[11]  Lusheng Wang,et al.  Graph Traversals, Genes and Matroids: An Efficient Case of the Travelling Salesman Problem , 1998, Discret. Appl. Math..

[12]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[13]  Hubert de Fraysseix,et al.  A Characterization of Circle Graphs , 1984, Eur. J. Comb..

[14]  G. Kirchhoff Ueber die Auflösung der Gleichungen, auf welche man bei der Untersuchung der linearen Vertheilung galvanischer Ströme geführt wird , 1847 .

[15]  Wen-Lian Hsu,et al.  Recognizing circle graphs in polynomial time , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[16]  Michael S. Waterman,et al.  Introduction to Computational Biology: Maps, Sequences and Genomes , 1998 .

[17]  M. Golumbic Algorithmic graph theory and perfect graphs , 1980 .

[18]  Jeremy P. Spinrad,et al.  Recognition of Circle Graphs , 1994, J. Algorithms.

[19]  Earl Hubbell Multiplex Sequencing by Hybridization , 2001, J. Comput. Biol..

[20]  Béla Bollobás,et al.  Modern Graph Theory , 2002, Graduate Texts in Mathematics.

[21]  Peter Winkler,et al.  Shuffling Biological Sequences , 1996, Discret. Appl. Math..

[22]  Béla Bollobás,et al.  A Probabilistic Proof of an Asymptotic Formula for the Number of Labelled Regular Graphs , 1980, Eur. J. Comb..

[23]  Béla Bollobás,et al.  Random Graphs , 1985 .

[24]  A. Barbour,et al.  Poisson Approximation , 1992 .

[25]  Béla Bollobás,et al.  The interlace polynomial: a new graph polynomial , 2000, SODA '00.

[26]  Julius v. Sz. Nagy Über ein topologisches Problem von Gauß , 1927 .