An improved string composition method for sequence comparison

BackgroundHistorically, two categories of computational algorithms (alignment-based and alignment-free) have been applied to sequence comparison–one of the most fundamental issues in bioinformatics. Multiple sequence alignment, although dominantly used by biologists, possesses both fundamental as well as computational limitations. Consequently, alignment-free methods have been explored as important alternatives in estimating sequence similarity. Of the alignment-free methods, the string composition vector (CV) methods, which use the frequencies of nucleotide or amino acid strings to represent sequence information, show promising results in genome sequence comparison of prokaryotes. The existing CV-based methods, however, suffer certain statistical problems, thereby underestimating the amount of evolutionary information in genetic sequences.ResultsWe show that the existing string composition based methods have two problems, one related to the Markov model assumption and the other associated with the denominator of the frequency normalization equation. We propose an improved complete composition vector method under the assumption of a uniform and independent model to estimate sequence information contributing to selection for sequence comparison. Phylogenetic analyses using both simulated and experimental data sets demonstrate that our new method is more robust compared with existing counterparts and comparable in robustness with alignment-based methods.ConclusionWe observed two problems existing in the currently used string composition methods and proposed a new robust method for the estimation of evolutionary information of genetic sequences. In addition, we discussed that it might not be necessary to use relatively long strings to build a complete composition vector (CCV), due to the overlapping nature of vector strings with a variable length. We suggested a practical approach for the choice of an optimal string length to construct the CCV.

[1]  M. Waterman,et al.  The Erdos-Renyi Law in Distribution, for Coin Tossing and Sequence Matching , 1990 .

[2]  E. Herniou,et al.  Use of Whole Genome Sequence Data To Infer Baculovirus Phylogeny , 2001, Journal of Virology.

[3]  J. Beckmann,et al.  Linguistics of nucleotide sequences: morphology and comparison of vocabularies. , 1986, Journal of biomolecular structure & dynamics.

[4]  William R. Pearson Protein sequence comparison and protein evolution , 1995, ISMB 1995.

[5]  Chi-Ren Shyu,et al.  Computational Identification of Reassortments in Avian Influenza Viruses , 2007, Avian diseases.

[6]  R. D. Sege,et al.  A statistical test for comparing several nucleotide sequences , 1982, Nucleic Acids Res..

[7]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[8]  Teresa K. Attwood,et al.  The Babel of Bioinformatics , 2000, Science.

[9]  Randy Goebel,et al.  Nucleotide composition string selection in HIV-1 subtyping using whole genomes , 2007, Bioinform..

[10]  M. Servedio,et al.  Phylogenetic analysis and intraspecific variation: performance of parsimony, likelihood, and distance methods. , 1998, Systematic biology.

[11]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[12]  G L Ada,et al.  Options for the control of influenza III. Cairns, North Queensland, Australia (4-9 May 1996). , 1997, Vaccine.

[13]  Khalid Sayood,et al.  A new sequence distance measure for phylogenetic tree construction , 2003, Bioinform..

[14]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[15]  J. Qi,et al.  Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach , 2003, Journal of Molecular Evolution.

[16]  S. Fitz-Gibbon,et al.  Using Homolog Groups to Create a Whole-Genomic Tree of Free-Living Organisms: An Update , 2002, Journal of Molecular Evolution.

[17]  Dong Xu,et al.  Phylogenetic analysis using complete signature information of whole genomes and clustered Neighbour-Joining method , 2006, Int. J. Bioinform. Res. Appl..

[18]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[19]  Bailin Hao,et al.  Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance. , 2004, Journal of bioinformatics and computational biology.

[20]  D. Davison,et al.  A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. , 1997, Biometrics.

[21]  J. Leader,et al.  A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. , 2002, Molecular biology and evolution.

[22]  R. Mullin,et al.  The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability. , 1989, Biometrics.

[23]  B. Snel,et al.  Genomes in flux: the evolution of archaeal and proteobacterial gene content. , 2002, Genome research.

[24]  T K Attwood Genomics. The Babel of bioinformatics. , 2000, Science.

[25]  K. Chu,et al.  Phylogeny of Prokaryotes and Chloroplasts Revealed by a Simple Composition Approach on All Protein Sequences from Complete Genomes Without Sequence Alignment , 2005, Journal of Molecular Evolution.

[26]  Steve Baker,et al.  Integrated gene and species phylogenies from unaligned whole genome protein sequences , 2002, Bioinform..