Multiple alignment-free sequence comparison

MOTIVATION Recently, a range of new statistics have become available for the alignment-free comparison of two sequences based on k-tuple word content. Here, we extend these statistics to the simultaneous comparison of more than two sequences. Our suite of statistics contains, first, C(*)1 and C(S)1, extensions of statistics for pairwise comparison of the joint k-tuple content of all the sequences, and second, C(*)2, C(S)2 and C(geo)2, averages of sums of pairwise comparison statistics. The two tasks we consider are, first, to identify sequences that are similar to a set of target sequences, and, second, to measure the similarity within a set of sequences. RESULTS Our investigation uses both simulated data as well as cis-regulatory module data where the task is to identify cis-regulatory modules with similar transcription factor binding sites. We find that although for real data, all of our statistics show a similar performance, on simulated data the Shepp-type statistics are in some instances outperformed by star-type statistics. The multiple alignment-free statistics are more sensitive to contamination in the data than the pairwise average statistics. AVAILABILITY Our implementation of the five statistics is available as R package named 'multiAlignFree' at be http://www-rcf.usc.edu/∼fsun/Programs/multiAlignFree/multiAlignFreemain.html. CONTACT reinert@stats.ox.ac.uk. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  James Taylor,et al.  Genomic approaches towards finding cis-regulatory modules in animals , 2012, Nature Reviews Genetics.

[2]  Gesine Reinert,et al.  New powerful statistics for alignment-free sequence comparison under a pattern transfer model. , 2011, Journal of theoretical biology.

[3]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics , 2010, J. Comput. Biol..

[4]  References , 1971 .

[5]  J. G. Reid Normal functions of normal random variables , 1987 .

[6]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Mark J. Kaiser The geometry of the minimum triangular separation functional , 1994 .

[8]  Susan R. Wilson,et al.  Approximate word matches between two random sequences , 2008 .

[9]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[10]  Kai Song,et al.  Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads , 2013, J. Comput. Biol..

[11]  E. Davidson The Regulatory Genome: Gene Regulatory Networks In Development And Evolution , 2006 .

[12]  A. Visel,et al.  ChIP-Seq identification of weakly conserved heart enhancers , 2010, Nature Genetics.

[13]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[14]  M. Waterman,et al.  Distributional regimes for the number of k-word matches between two random sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Minghua Deng,et al.  Comparison of metagenomic samples using sequence signatures , 2012, BMC Genomics.

[16]  M. P. Quine A result of shepp , 1994 .

[17]  Martin Klingler,et al.  Structure and evolution of a pair-rule interaction element: runt regulatory sequences in D. melanogaster and D. virilis , 1999, Mechanisms of Development.

[18]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[19]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[20]  Pavel Tomancak,et al.  An alignment-free method to identify candidate orthologous enhancers in multiple Drosophila genomes , 2010, Bioinform..

[21]  Martin Vingron,et al.  Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts , 2012, Bioinform..

[22]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.