GToTree: a user-friendly workflow for phylogenomics

Summary Genome-level evolutionary inference (i.e., phylogenomics) is becoming an increasingly essential step in many biologists’ work - such as in the characterization of newly recovered genomes, or in leveraging available reference genomes to guide evolutionary questions. Accordingly, there are several tools available for the major steps in a phylogenomics workflow. But for the biologist whose main focus is not bioinformatics, much of the computational work required - such as accessing genomic data on large scales, integrating genomes from different file formats, performing required filtering, stitching different tools together, etc. - can be prohibitive. Here I introduce GToTree, a command-line tool that can take any combination of fasta files, GenBank files, and/or NCBI assembly accessions as input and outputs an alignment file, estimates of genome completeness and redundancy, and a phylogenomic tree based on the specified singlecopy gene (SCG) set. While GToTree can work with any custom hidden Markov Models (HMMs), also included are 13 newly generated SCG-set HMMs for different lineages and levels of resolution, built based on searches of ~12,000 bacterial and archaeal high-quality genomes. GToTree aims to give more researchers the capability to make phylogenomic trees. Availability GToTree is open-source and freely available for download from: github.com/AstrobioMike/GToTree Documentation github.com/AstrobioMike/GToTree/wiki Implementation GToTree is implemented primarily in bash, with helper scripts written in Python. Contact Mike.Lee@nasa.gov

[1]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[2]  Toni Gabaldón,et al.  trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses , 2009, Bioinform..

[3]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[4]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[5]  Edward C. Uberbacher,et al.  Gene and translation initiation site prediction in metagenomic sequences , 2012, Bioinform..

[6]  D. Huson,et al.  Dendroscope 3: an interactive tool for rooted phylogenetic trees and networks. , 2012, Systematic biology.

[7]  N. Kashtan,et al.  Single-Cell Genomics Reveals Hundreds of Coexisting Subpopulations in Wild Prochlorococcus , 2014, Science.

[8]  Tom O. Delmont,et al.  Anvi’o: an advanced analysis and visualization platform for ‘omics data , 2015, PeerJ.

[9]  Peer Bork,et al.  Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees , 2016, Nucleic Acids Res..

[10]  Brian C. Thomas,et al.  A new view of the tree of life , 2016, Nature Microbiology.

[11]  Donovan H. Parks,et al.  Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life , 2017, Nature Microbiology.

[12]  R. Braakman,et al.  Metabolic evolution and the self-organization of ecosystems , 2017, Proceedings of the National Academy of Sciences.

[13]  T. Hackl,et al.  Single cell genomes of Prochlorococcus, Synechococcus, and sympatric microbes from diverse marine environments , 2018, Scientific Data.

[14]  Silvio C. E. Tosatto,et al.  The Pfam protein families database in 2019 , 2018, Nucleic Acids Res..

[15]  Wei Shen,et al.  TaxonKit: a cross-platform and efficient NCBI taxonomy toolkit , 2019, bioRxiv.