ParGenes: a tool for massively parallel model selection and phylogenetic tree inference on thousands of genes

Abstract Motivation Coalescent- and reconciliation-based methods are now widely used to infer species phylogenies from genomic data. They typically use per-gene phylogenies as input, which requires conducting multiple individual tree inferences on a large set of multiple sequence alignments (MSAs). At present, no easy-to-use parallel tool for this task exists. Ad hoc scripts for this purpose do not only induce additional implementation overhead, but can also lead to poor resource utilization and long times-to-solution. We present ParGenes, a tool for simultaneously determining the best-fit model and inferring maximum likelihood (ML) phylogenies on thousands of independent MSAs using supercomputers. Results ParGenes executes common phylogenetic pipeline steps such as model-testing, ML inference(s), bootstrapping and computation of branch support values via a single parallel program invocation. We evaluated ParGenes by inferring > 20 000 phylogenetic gene trees with bootstrap support values from Ensembl Compara and VectorBase alignments in 28 h on a cluster with 1024 nodes. Availability and implementation GNU GPL at https://github.com/BenoitMorel/ParGenes. Supplementary information Supplementary material is available at Bioinformatics online.

[1]  David Posada,et al.  MODELTEST: testing the model of DNA substitution , 1998, Bioinform..

[2]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[3]  Alexandros Stamatakis,et al.  Using RAxML to Infer Phylogenies , 2015, Current protocols in bioinformatics.

[4]  A. von Haeseler,et al.  A likelihood framework to measure horizontal gene transfer. , 2007, Molecular biology and evolution.

[5]  Bengt Sennblad,et al.  Bayesian gene/species tree reconciliation and orthology analysis using MCMC , 2003, ISMB.

[6]  A. von Haeseler,et al.  IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies , 2014, Molecular biology and evolution.

[7]  Tandy J. Warnow,et al.  ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes , 2015, Bioinform..

[8]  Oleksii Kozlov,et al.  Models, Optimizations, and Tools for Large-Scale Phylogenetic Inference, Handling Sequence Uncertainty, and Taxonomic Validation , 2018 .

[9]  Alice S. Etim,et al.  The Use Of Social Media And Collaborative Tools For Virtual Teaming - A Global Market Reach Out By Navibank , 2015 .

[10]  M. Gouy,et al.  Genome-scale coestimation of species and gene trees , 2013, Genome research.

[11]  Alexey M. Kozlov,et al.  ExaML version 3: a tool for phylogenomic analyses on supercomputers , 2015, Bioinform..

[12]  Sandra Gesing,et al.  VectorBase: an updated bioinformatics resource for invertebrate vectors and other organisms related with human diseases , 2014, Nucleic Acids Res..