Phylogenetic tree construction using trinucleotide usage profile (TUP)

BackgroundIt has been a challenging task to build a genome-wide phylogenetic tree for a large group of species containing a large number of genes with long nucleotides sequences. The most popular method, called feature frequency profile (FFP-k), finds the frequency distribution for all words of certain length k over the whole genome sequence using (overlapping) windows of the same length. For a satisfactory result, the recommended word length (k) ranges from 6 to 15 and it may not be a multiple of 3 (codon length). The total number of possible words needed for FFP-k can range from 46=4096 to 415.ResultsWe propose a simple improvement over the popular FFP method using only a typical word length of 3. A new method, called Trinucleotide Usage Profile (TUP), is proposed based only on the (relative) frequency distribution using non-overlapping windows of length 3. The total number of possible words needed for TUP is 43=64, which is much less than the total count for the recommended optimal “resolution” for FFP. To build a phylogenetic tree, we propose first representing each of the species by a TUP vector and then using an appropriate distance measure between pairs of the TUP vectors for the tree construction. In particular, we propose summarizing a DNA sequence by a matrix of three rows corresponding to three reading frames, recording the frequency distribution of the non-overlapping words of length 3 in each of the reading frame. We also provide a numerical measure for comparing trees constructed with various methods.ConclusionsCompared to the FFP method, our empirical study showed that the proposed TUP method is more capable of building phylogenetic trees with a stronger biological support. We further provide some justifications on this from the information theory viewpoint. Unlike the FFP method, the TUP method takes the advantage that the starting of the first reading frame is (usually) known. Without this information, the FFP method could only rely on the frequency distribution of overlapping words, which is the average (or mixture) of the frequency distributions of three possible reading frames. Consequently, we show (from the entropy viewpoint) that the FFP procedure could dilute important gene information and therefore provides less accurate classification.

[1]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[3]  Tiee-Jian Wu,et al.  Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences , 2005, Bioinform..

[4]  B. Slatko,et al.  On the taxonomic status of the intracellular bacterium Wolbachia pipientis: should this species name include the intracellular bacteria of filarial nematodes? , 2007, International journal of systematic and evolutionary microbiology.

[5]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[6]  B. Blaisdell,et al.  Average values of a dissimilarity measure not requiring sequence alignment are twice the averages of conventional mismatch counts requiring sequence alignment for a variety of computer-generated model systems , 1989, Journal of Molecular Evolution.

[7]  B. Blaisdell,et al.  Average values of a dissimilarity measure not requiring sequence alignment are twice the averages of conventional mismatch counts requiring sequence alignment for a computer-generated model system , 2007, Journal of Molecular Evolution.

[8]  J. Knapp,et al.  Historical perspectives and identification of Neisseria and related species , 1988, Clinical Microbiology Reviews.

[9]  N. Moran,et al.  Calibrating bacterial evolution. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[10]  N. Ohashi,et al.  Classification of Rickettsia tsutsugamushi in a new genus, Orientia gen. nov., as Orientia tsutsugamushi comb. nov. , 1995, International journal of systematic bacteriology.

[11]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[12]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[13]  D. Davison,et al.  A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. , 1997, Biometrics.

[14]  K. Schleifer,et al.  Update of the All-Species Living Tree Project based on 16S and 23S rRNA sequence analyses. , 2010, Systematic and applied microbiology.

[15]  Jimmy Kuo,et al.  Bacterial phylogenetic tree construction based on genomic translation stop signals , 2012, Microbial Informatics and Experimentation.

[16]  C. Fraser,et al.  Fuzzy species among recombinogenic bacteria , 2005, BMC Biology.

[17]  Robert C. Edgar,et al.  Local homology recognition and distance measures in linear time using compressed amino acid alphabets. , 2004, Nucleic acids research.

[18]  M. Surette,et al.  Intergenic Sequence Comparison of Escherichia coli Isolates Reveals Lifestyle Adaptations but Not Host Specificity , 2011, Applied and Environmental Microbiology.

[19]  Jacques van Helden,et al.  Metrics for comparing regulatory sequences on the basis of pattern counts , 2004, Bioinform..

[20]  E. Denamur,et al.  The Evolutionary History of Shigella and Enteroinvasive Escherichia coli Revised , 2003, Journal of Molecular Evolution.

[21]  Sung-Hou Kim,et al.  Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method , 2009, Proceedings of the National Academy of Sciences.

[22]  Thomas Mailund,et al.  A sub-cubic time algorithm for computing the quartet distance between two general trees , 2011, Algorithms for Molecular Biology.

[23]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[24]  Max H. Garzon,et al.  DNA Chips for Species Identification and Biological Phylogenies , 2009, DNA.

[25]  Bailin Hao,et al.  Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance. , 2004, Journal of bioinformatics and computational biology.

[26]  R. Sokal,et al.  THE COMPARISON OF DENDROGRAMS BY OBJECTIVE METHODS , 1962 .

[27]  Se-Ran Jun,et al.  Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution , 2009, Proceedings of the National Academy of Sciences.

[28]  W. Liesack,et al.  The phylogeny of the genus Yersinia based on 16S rDNA sequences. , 1993, FEMS microbiology letters.