Shuffling Biological Sequences

Abstract This paper considers the following sequence shuffling problem: Given a biological sequence (either DNA or protein) s, generate a random instance among all the permutations of s that exhibit the same frequencies of k-lets (e.g. dinucleotides, doublets of amino acids, triplets, etc.). Since certain biases in the usage of k-lets are fundamental to biological sequences, effective generation of such sequences is essential for the evaluation of the results of many sequence analysis tools. This paper introduces two sequence shuffling algorithms: A simple swapping-based algorithm is shown to generate a near-random instance and appears to work well, although its efficiency is unproven; a generation algorithm based on Euler tours is proven to produce a precisely uniform instance, and hence solve the sequence shuffling problem, in time not much more than linear in the sequence length.

[1]  David Aldous,et al.  The Random Walk Construction of Uniform Spanning Trees and Uniform Labelled Trees , 1990, SIAM J. Discret. Math..

[2]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[3]  O. Perron,et al.  Grundlagen für eine Theorie des Jacobischen Kettenbruchalgorithmus , 1907 .

[4]  F. Gantmacher,et al.  Applications of the theory of matrices , 1960 .

[5]  Andrei Z. Broder,et al.  Generating random spanning trees , 1989, 30th Annual Symposium on Foundations of Computer Science.

[6]  O. Perron Zur Theorie der Matrices , 1907 .

[7]  S Karlin,et al.  Heterogeneity of genomes: measures and values. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Shimon Even,et al.  Graph Algorithms , 1979 .

[9]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[10]  Peter G. Doyle,et al.  Random Walks and Electric Networks: REFERENCES , 1987 .

[11]  Zvi Galil,et al.  Proceedings of the 30th IEEE symposium on Foundations of computer science , 1994, FOCS 1994.

[12]  S. Altschul,et al.  Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. , 1985, Molecular biology and evolution.

[13]  P. Diaconis,et al.  Trailing the Dovetail Shuffle to its Lair , 1992 .

[14]  S. Karlin,et al.  Chance and statistical significance in protein and DNA sequence analysis. , 1992, Science.