Agalma: an automated phylogenomics workflow

BackgroundIn the past decade, transcriptome data have become an important component of many phylogenetic studies. They are a cost-effective source of protein-coding gene sequences, and have helped projects grow from a few genes to hundreds or thousands of genes. Phylogenetic studies now regularly include genes from newly sequenced transcriptomes, as well as publicly available transcriptomes and genomes. Implementing such a phylogenomic study, however, is computationally intensive, requires the coordinated use of many complex software tools, and includes multiple steps for which no published tools exist. Phylogenomic studies have therefore been manual or semiautomated. In addition to taking considerable user time, this makes phylogenomic analyses difficult to reproduce, compare, and extend. In addition, methodological improvements made in the context of one study often cannot be easily applied and evaluated in the context of other studies.ResultsWe present Agalma, an automated tool that constructs matrices for phylogenomic analyses. The user provides raw Illumina transcriptome data, and Agalma produces annotated assemblies, aligned gene sequence matrices, a preliminary phylogeny, and detailed diagnostics that allow the investigator to make extensive assessments of intermediate analysis steps and the final results. Sequences from other sources, such as externally assembled genomes and transcriptomes, can also be incorporated in the analyses. Agalma is built on the BioLite bioinformatics framework, which tracks provenance, profiles processor and memory use, records diagnostics, manages metadata, installs dependencies, logs version numbers and calls to external programs, and enables rich HTML reports for all stages of the analysis. Agalma includes a small test data set and a built-in test analysis of these data. In addition to describing Agalma, we here present a sample analysis of a larger seven-taxon data set. Agalma is available for download at https://bitbucket.org/caseywdunn/agalma.ConclusionsAgalma allows complex phylogenomic analyses to be implemented and described unambiguously as a series of high-level commands. This will enable phylogenomic studies to be readily reproduced, modified, and extended. Agalma also facilitates methods development by providing a complete modular workflow, bundled with test data, that will allow further optimization of each step in the context of a full phylogenomic analysis.

[1]  M. Gouy,et al.  Genome-scale coestimation of species and gene trees , 2013, Genome research.

[2]  M. Martindale,et al.  Assessing the root of bilaterian animals with scalable phylogenomic methods , 2009, Proceedings of the Royal Society B: Biological Sciences.

[3]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[4]  Stijn van Dongen,et al.  Using MCL to extract clusters from networks. , 2012, Methods in molecular biology.

[5]  David Q. Matus,et al.  Broad phylogenomic sampling improves resolution of the animal tree of life , 2008, Nature.

[6]  Michael J Sanderson,et al.  Phylogenetic Signal in the Eukaryotic Tree of Life , 2008, Science.

[7]  Gerard Talavera,et al.  Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. , 2007, Systematic biology.

[8]  Benjamin M. Wheeler,et al.  The dynamic genome of Hydra , 2010, Nature.

[9]  Gaston H. Gonnet,et al.  Inferring Hierarchical Orthologous Groups from Orthologous Gene Pairs , 2013, PloS one.

[10]  Ole Tange,et al.  GNU Parallel: The Command-Line Power Tool , 2011, login Usenix Mag..

[11]  F. Delsuc,et al.  Tunicates and not cephalochordates are the closest living relatives of vertebrates , 2006, Nature.

[12]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[13]  Mark L. Blaxter,et al.  PartiGene-constructing partial genomes , 2004, Bioinform..

[14]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[15]  Nicholas A. Sinnott-Armstrong,et al.  BioLite, a Lightweight Bioinformatics Framework with Automated Tracking of Diagnostics and Provenance , 2012, TaPP.

[16]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[17]  Stephen A. Smith,et al.  Resolving the evolutionary relationships of molluscs with phylogenomic tools , 2011, Nature.

[18]  Jeet Sukumaran,et al.  DendroPy: a Python library for phylogenetic computing , 2010, Bioinform..

[19]  Gonzalo Giribet,et al.  Higher-level metazoan relationships: recent progress and remaining questions , 2011, Organisms Diversity & Evolution.

[20]  Corinne Da Silva,et al.  Phylogenomics Revives Traditional Views on Deep Animal Relationships , 2009, Current Biology.

[21]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[22]  Nicholas H. Putnam,et al.  Sea Anemone Genome Reveals Ancestral Eumetazoan Gene Repertoire and Genomic Organization , 2007, Science.

[23]  C. Dunn,et al.  Molecular phylogenetics of the siphonophora (Cnidaria), with implications for the evolution of functional specialization. , 2005, Systematic biology.

[24]  H. Philippe,et al.  Large-scale sequencing and the new animal phylogeny. , 2006, Trends in ecology & evolution.

[25]  Antonis Rokas,et al.  Inferring ancient divergences requires genes with strong phylogenetic signals , 2013, Nature.

[26]  Frédéric Delsuc,et al.  MACSE: Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons , 2011, PloS one.

[27]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.