Detecting Phylogenetic Signals in Eukaryotic Whole Genome Sequences

Whole genome sequences are a rich source of molecular data, with a potential for the discovery of novel evolutionary information. Yet, many parts of these sequences are not known to be under evolutionary pressure and, thus, are not conserved. Furthermore, a good model for whole genome evolution does not exist. Consequently, it is not a priori clear if a meaningful phylogenetic signal exists and can be extracted from the sequences as a whole. Indeed, very few phylogenies were reconstructed based on these sequences. Prior to this work, only two reconstruction methods were applied to large eukaryotic genomes: the K(r) method (Haubold et al., 2009), which was applied to genomes of rather small diversity (Drosophila species), and the feature frequency profile method (Sims et al., 2009a), which was applied to genomes of moderate diversity (mammals). We investigate the whole genome-based phylogenetic reconstruction question with respect to a much wider taxonomic sample. We apply K(r), FFP, and an alternative alignment-free method, the average common subsequence (ACS) (Ulitsky et al., 2006), to 24 multicellular eukaryotes (vertebrates, invertebrates, and plants). We also apply ACS to the proteome sequences of these 24 taxa. We compare the resulting trees to a standard reference, the National Center for Biotechnology Information (NCBI) taxonomy tree. Trees produced by ACS(AA), based on proteomes, are in complete agreement with the NCBI tree. For the genome-based reconstruction, ACS(DNA) produces trees whose agreement with the NCBI tree is excellent to very good for divergence times up to 800 million years ago, medium at 1 billion years ago, and poor at 1.6 billion years ago. We conclude that whole genomes do carry a clear phylogenetic signal, yet this signal "saturates" with longer divergence times. Furthermore, from the few existing methods, ACS is best capable of detecting this signal.

[1]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[2]  Jan-Ming Ho,et al.  The UniMarker (UM) method for synteny mapping of large genomes , 2004, Bioinform..

[3]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences: current status, policy and new initiatives , 2008, Nucleic Acids Res..

[4]  M. Nei,et al.  The neighbor-joining method , 1987 .

[5]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[6]  Inna Dubchak,et al.  The genome portal of the Department of Energy Joint Genome Institute: 2014 updates , 2013, Nucleic Acids Res..

[7]  Peer Bork,et al.  Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation , 2007, Bioinform..

[8]  Mark W. Westneat,et al.  Vertebrates: Comparative Anatomy, Function, Evolution.— Kenneth V. Kardong. 1998. Second Edition. McGraw-Hill, Boston, Massachusetts , 1998 .

[9]  The Invertebrates , 1959, Nature.

[10]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology , 2003, Nucleic Acids Res..

[11]  Mary Goldman,et al.  The UCSC Genome Browser database: update 2011 , 2010, Nucleic Acids Res..

[12]  J. Leader,et al.  A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. , 2002, Molecular biology and evolution.

[13]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[14]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[15]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[16]  Bruno Nyffeler,et al.  Early History of Mammals Is Elucidated with the ENCODE Multiple Species Sequencing Data , 2007, PLoS genetics.

[17]  I-Min A. Chen,et al.  The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata , 2007, Nucleic Acids Res..

[18]  Joel Dudley,et al.  TimeTree: a public knowledge-base of divergence times among organisms , 2006, Bioinform..

[19]  Eric D. Green,et al.  Confirming the Phylogeny of Mammals by Use of Large Comparative Sequence Data Sets , 2008, Molecular biology and evolution.

[20]  Alberto Apostolico,et al.  Fast algorithms for computing sequence distances by exhaustive substring composition , 2008, Algorithms for Molecular Biology.

[21]  S. O’Brien,et al.  Molecular phylogenetics and the origins of placental mammals , 2001, Nature.

[22]  J. Qi,et al.  Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach , 2003, Journal of Molecular Evolution.

[23]  Olivier Gascuel,et al.  Concerning the NJ algorithm and its unweighted version, UNJ , 1996, Mathematical Hierarchies and Biology.

[24]  M. P. Cummings PHYLIP (Phylogeny Inference Package) , 2004 .

[25]  Walter M. Fitch,et al.  On the Problem of Discovering the Most Parsimonious Tree , 1977, The American Naturalist.

[26]  Steve Baker,et al.  Integrated gene and species phylogenies from unaligned whole genome protein sequences , 2002, Bioinform..

[27]  Thérèse A. Holton,et al.  Deep Genomic-Scale Analyses of the Metazoa Reject Coelomata: Evidence from Single- and Multigene Families Analyzed Under a Supertree and Supermatrix Paradigm , 2010, Genome biology and evolution.

[28]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[29]  Mark A. Ragan,et al.  Pattern-Based Phylogenetic Distance Estimation and Tree Reconstruction , 2006 .

[30]  N. Saitou,et al.  Relative Efficiencies of the Fitch-Margoliash, Maximum-Parsimony, Maximum-Likelihood, Minimum-Evolution, and Neighbor-joining Methods of Phylogenetic Tree Construction in Obtaining the Correct Tree , 1989 .

[31]  David Burstein,et al.  The Average Common Substring Approach to Phylogenomic Reconstruction , 2006, J. Comput. Biol..

[32]  Thomas Wiehe,et al.  Estimating Mutation Distances from Unaligned Genomes , 2009, J. Comput. Biol..

[33]  Andrew M. Jenkinson,et al.  Ensembl 2009 , 2008, Nucleic Acids Res..

[34]  Ting Wang,et al.  The UCSC Genome Browser Database: update 2009 , 2008, Nucleic Acids Res..

[35]  Khalid Sayood,et al.  A new sequence distance measure for phylogenetic tree construction , 2003, Bioinform..

[36]  K. Kardong,et al.  Vertebrates: Comparative Anatomy, Function, Evolution , 1994 .

[37]  Alberto Apostolico,et al.  Efficient tools for comparative substring analysis. , 2010, Journal of biotechnology.

[38]  Se-Ran Jun,et al.  Whole-genome phylogeny of mammals: Evolutionary information in genic and nongenic regions , 2009, Proceedings of the National Academy of Sciences.

[39]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[40]  David Q. Matus,et al.  Broad phylogenomic sampling improves resolution of the animal tree of life , 2008, Nature.

[41]  Alain Guénoche,et al.  Comparison of alignment free string distances for complete genome phylogeny , 2009, Adv. Data Anal. Classif..