uShuffle: A useful tool for shuffling biological sequences while preserving the k-let counts

BackgroundRandomly shuffled sequences are routinely used in sequence analysis to evaluate the statistical significance of a biological sequence. In many cases, biologists need sophisticated shuffling tools that preserve not only the counts of distinct letters but also higher-order statistics such as doublet counts, triplet counts, and, in general, k-let counts.ResultsWe present a sequence analysis tool (named uShuffle) for generating uniform random permutations of biological sequences (such as DNAs, RNAs, and proteins) that preserve the exact k-let counts. The uShuffle tool implements the latest variant of the Euler algorithm and uses Wilson's algorithm in the crucial step of arborescence generation. It is carefully engineered and extremely efficient. The uShuffle tool achieves maximum flexibility by allowing arbitrary alphabet size and let size. It can be used as a command-line program, a web application, or a utility library. Source code in C, Java, and C#, and integration instructions for Perl and Python are provided.ConclusionThe uShuffle tool surpasses existing implementation of the Euler algorithm in both performance and flexibility. It is a useful tool for the bioinformatics community.

[1]  S. Altschul,et al.  Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. , 1985, Molecular biology and evolution.

[2]  R Nussinov,et al.  Some rules in the ordering of nucleotides in the DNA. , 1980, Nucleic acids research.

[3]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[4]  Vincent Moulton,et al.  A comparison of RNA folding measures , 2005, BMC Bioinformatics.

[5]  P. Clote,et al.  Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. , 2005, RNA.

[6]  Elena Rivas,et al.  Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs , 2000, Bioinform..

[7]  A. Krogh,et al.  No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution. , 1999, Nucleic acids research.

[8]  David Bruce Wilson,et al.  How to get an exact sample from a generic Markov chain and sample a random spanning tree from a directed graph, both within the cover time , 1996, SODA '96.

[9]  Charles J. Colbourn,et al.  Unranking and Ranking Spanning Trees of a Graph , 1989, J. Algorithms.

[10]  M. Waterman,et al.  Statistical characterization of nucleic acid sequence functional domains. , 1983, Nucleic acids research.

[11]  David Bruce Wilson,et al.  Generating random spanning trees more quickly than the cover time , 1996, STOC '96.

[12]  Peter Winkler,et al.  Shuffling Biological Sequences , 1996, Discret. Appl. Math..

[13]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[14]  Andrei Z. Broder,et al.  Generating random spanning trees , 1989, 30th Annual Symposium on Foundations of Computer Science.

[15]  David W. Digby,et al.  mRNAs have greater negative folding free energies than shuffled or codon choice randomized sequences. , 1999, Nucleic acids research.

[16]  David Aldous,et al.  The Random Walk Construction of Uniform Spanning Trees and Uniform Labelled Trees , 1990, SIAM J. Discret. Math..

[17]  Eivind Coward,et al.  Shufflet: shuffling sequences while conserving the k-let counts , 1999, Bioinform..

[18]  Alain Guénoche Random Spanning Tree , 1983, J. Algorithms.

[19]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[20]  Charles J. Colbourn,et al.  Two Algorithms for Unranking Arborescences , 1996, J. Algorithms.

[21]  W. Fitch Random sequences. , 1983, Journal of molecular biology.

[22]  David Bruce Wilson,et al.  How to Get a Perfectly Random Sample from a Generic Markov Chain and Generate a Random Spanning Tree of a Directed Graph , 1998, J. Algorithms.

[23]  Yves Van de Peer,et al.  Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free energies than random sequences , 2004, Bioinform..

[24]  Jih-Hsiang Chen,et al.  A program for predicting significant RNA secondary structures , 1988, Comput. Appl. Biosci..

[25]  Vidyadhar G. Kulkarni,et al.  Generating Random Combinatorial Objects , 1990, J. Algorithms.