OrthoSelect: a protocol for selecting orthologous groups in phylogenomics

BackgroundPhylogenetic studies using expressed sequence tags (EST) are becoming a standard approach to answer evolutionary questions. Such studies are usually based on large sets of newly generated, unannotated, and error-prone EST sequences from different species. A first crucial step in EST-based phylogeny reconstruction is to identify groups of orthologous sequences. From these data sets, appropriate target genes are selected, and redundant sequences are eliminated to obtain suitable sequence sets as input data for tree-reconstruction software. Generating such data sets manually can be very time consuming. Thus, software tools are needed that carry out these steps automatically.ResultsWe developed a flexible and user-friendly software pipeline, running on desktop machines or computer clusters, that constructs data sets for phylogenomic analyses. It automatically searches assembled EST sequences against databases of orthologous groups (OG), assigns ESTs to these predefined OGs, translates the sequences into proteins, eliminates redundant sequences assigned to the same OG, creates multiple sequence alignments of identified orthologous sequences and offers the possibility to further process this alignment in a last step by excluding potentially homoplastic sites and selecting sufficiently conserved parts. Our software pipeline can be used as it is, but it can also be adapted by integrating additional external programs. This makes the pipeline useful for non-bioinformaticians as well as to bioinformatic experts. The software pipeline is especially designed for ESTs, but it can also handle protein sequences.ConclusionOrthoSelect is a tool that produces orthologous gene alignments from assembled ESTs. Our tests show that OrthoSelect detects orthologs in EST libraries with high accuracy. In the absence of a gold standard for orthology prediction, we compared predictions by OrthoSelect to a manually created and published phylogenomic data set. Our tool was not only able to rebuild the data set with a specificity of 98%, but it detected four percent more orthologous sequences. Furthermore, the results OrthoSelect produces are in absolut agreement with the results of other programs, but our tool offers a significant speedup and additional functionality, e.g. handling of ESTs, computing sequence alignments, and refining them. To our knowledge, there is currently no fully automated and freely available tool for this purpose. Thus, OrthoSelect is a valuable tool for researchers in the field of phylogenomics who deal with large quantities of EST sequences. OrthoSelect is written in Perl and runs on Linux/Mac OS X. The tool can be downloaded at http://gobics.de/fabian/orthoselect.php

[1]  Y. Hayashizaki,et al.  Amino acid translation program for full-length cDNA sequences with frameshift errors. , 2001, Physiological genomics.

[2]  Wolfgang Gentzsch,et al.  Sun Grid Engine: towards creating a compute power grid , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[3]  J. G. Burleigh,et al.  Identifying optimal incomplete phylogenetic data sets from sequence databases. , 2005, Molecular phylogenetics and evolution.

[4]  Leo X. Liu,et al.  Large-scale taxonomic profiling of eukaryotic model organisms: a comparison of orthologous proteins encoded by the human, fly, nematode, and yeast genomes. , 1998, Genome research.

[5]  Sean R. Eddy,et al.  RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs , 2002, BMC Bioinformatics.

[6]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[7]  Sean R. Eddy,et al.  A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation , 2008, PLoS Comput. Biol..

[8]  Alexander C. J. Roth,et al.  Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits , 2006, Nucleic acids research.

[9]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[10]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[11]  E. Koonin Orthologs, Paralogs, and Evolutionary Genomics 1 , 2005 .

[12]  Yi Zhou,et al.  BLASTO: a tool for searching orthologous groups , 2007, Nucleic Acids Res..

[13]  David Q. Matus,et al.  Broad phylogenomic sampling improves resolution of the animal tree of life , 2008, Nature.

[14]  Michael Kaufmann,et al.  BMC Bioinformatics BioMed Central , 2005 .

[15]  W. Fitch Distinguishing homologous from analogous proteins. , 1970, Systematic zoology.

[16]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[17]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[18]  Corinne Da Silva,et al.  Phylogenomics Revives Traditional Views on Deep Animal Relationships , 2009, Current Biology.

[19]  Martin Reczko,et al.  DIANA-EST: a statistical analysis , 2001, Bioinform..

[20]  Stefan Grünewald,et al.  Noisy: Identification of problematic columns in multiple sequence alignments , 2008, Algorithms for Molecular Biology.

[21]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[22]  Thomas L. Madden,et al.  BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. , 1999, FEMS microbiology letters.

[23]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[24]  John J. Wiens,et al.  Missing data and the design of phylogenetic analyses , 2006, J. Biomed. Informatics.

[25]  E. Koonin Orthologs, paralogs, and evolutionary genomics. , 2005, Annual review of genetics.

[26]  F. Delsuc,et al.  Phylogenomics and the reconstruction of the tree of life , 2005, Nature Reviews Genetics.

[27]  Sonja J. Prohaska,et al.  Multiple sequence alignment with user-defined anchor points , 2006, Algorithms for Molecular Biology.

[28]  Mark L. Blaxter,et al.  prot4EST: Translating Expressed Sequence Tags from neglected genomes , 2004, BMC Bioinformatics.

[29]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[30]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[31]  Feng Chen,et al.  OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups , 2005, Nucleic Acids Res..

[32]  D. Botstein,et al.  Orthology and functional conservation in eukaryotes. , 2007, Annual review of genetics.

[33]  Michael Kaufmann,et al.  DIALIGN P: Fast pair-wise and multiple sequence alignment using parallel processors , 2004, BMC Bioinformatics.

[34]  J A Eisen,et al.  Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. , 1998, Genome research.

[35]  Michael Kaufmann,et al.  DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment , 2008, Algorithms for Molecular Biology.

[36]  H. Gee Evolution: Ending incongruence , 2003, Nature.

[37]  Sarah J. Bourlat,et al.  Deuterostome phylogeny reveals monophyletic chordates and the new phylum Xenoturbellida , 2006, Nature.

[38]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[39]  E. Koonin,et al.  Orthology, paralogy and proposed classification for paralog subtypes. , 2002, Trends in genetics : TIG.

[40]  Katharina Misof,et al.  A Monte Carlo approach successfully identifies randomness in multiple sequence alignments: a more objective means of data exclusion. , 2009, Systematic biology.

[41]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[42]  K. Gichner,et al.  Annual review of genetics , 1987, Biologia Plantarum.

[43]  Tao Liu,et al.  TreeFam: 2008 Update , 2007, Nucleic Acids Res..

[44]  Gert Wörheide,et al.  OrthoSelect: a web server for selecting orthologous gene alignments from EST sequences , 2009, Nucleic Acids Res..

[45]  F. Delsuc,et al.  Tunicates and not cephalochordates are the closest living relatives of vertebrates , 2006, Nature.

[46]  Wei Qian,et al.  Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. , 2000, Molecular biology and evolution.

[47]  C. V. Jongeneel,et al.  Modeling sequencing errors by combining Hidden Markov models , 2003, ECCB.

[48]  R. Durbin,et al.  GeneWise and Genomewise. , 2004, Genome research.

[49]  M. Gouy,et al.  HOVERGEN: a database of homologous vertebrate genes. , 1994, Nucleic acids research.

[50]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[51]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[52]  Erik L. L. Sonnhammer,et al.  Inparanoid: a comprehensive database of eukaryotic orthologs , 2004, Nucleic Acids Res..

[53]  Tcoffee@igs: A web server for computing, evaluating and combining multiple sequence alignments. , 2003, Nucleic acids research.

[54]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.