ReproPhylo: An Environment for Reproducible Phylogenomics

The reproducibility of experiments is key to the scientific process, and particularly necessary for accurate reporting of analyses in data-rich fields such as phylogenomics. We present ReproPhylo, a phylogenomic analysis environment developed to ensure experimental reproducibility, to facilitate the handling of large-scale data, and to assist methodological experimentation. Reproducibility, and instantaneous repeatability, is built in to the ReproPhylo system and does not require user intervention or configuration because it stores the experimental workflow as a single, serialized Python object containing explicit provenance and environment information. This ‘single file’ approach ensures the persistence of provenance across iterations of the analysis, with changes automatically managed by the version control program Git. This file, along with a Git repository, are the primary reproducibility outputs of the program. In addition, ReproPhylo produces an extensive human-readable report and generates a comprehensive experimental archive file, both of which are suitable for submission with publications. The system facilitates thorough experimental exploration of both parameters and data. ReproPhylo is a platform independent CC0 Python module and is easily installed as a Docker image or a WinPython self-sufficient package, with a Jupyter Notebook GUI, or as a slimmer version in a Galaxy distribution.

[1]  A. Lemmon,et al.  Anchored hybrid enrichment for massively high-throughput phylogenomics. , 2012, Systematic biology.

[2]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[3]  Andy Purvis,et al.  phyloGenerator: an automated phylogeny generation tool for ecologists , 2013 .

[4]  A. Kawahara,et al.  Phylogenomics provides strong evidence for relationships of butterflies and moths , 2014, Proceedings of the Royal Society B: Biological Sciences.

[5]  S. Kalisz,et al.  A ROLE FOR NONADAPTIVE PROCESSES IN PLANT GENOME SIZE EVOLUTION? , 2010, Evolution; international journal of organic evolution.

[6]  Steffen Schulze-Kremer,et al.  Ontologies for molecular biology and bioinformatics , 2002, Silico Biol..

[7]  Carl Boettiger,et al.  An introduction to Docker for reproducible research , 2014, OPSR.

[8]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[9]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[10]  M. Pagel Detecting correlated evolution on phylogenies: a general method for the comparative analysis of discrete characters , 1994, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[11]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[12]  Gerard Talavera,et al.  Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. , 2007, Systematic biology.

[13]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[14]  W. S. Valdar,et al.  Scoring residue conservation , 2002, Proteins.

[15]  Toni Gabaldón,et al.  trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses , 2009, Bioinform..

[16]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[17]  K. Sjölander,et al.  Taking the first steps towards a standard for reporting on phylogenies: Minimum Information About a Phylogenetic Analysis (MIAPA). , 2006, Omics : a journal of integrative biology.

[18]  Nicolas Lartillot,et al.  PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating , 2009, Bioinform..

[19]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[20]  Christian M. Zmasek,et al.  phyloXML: XML for evolutionary biology and comparative genomics , 2009, BMC Bioinformatics.

[21]  J. Felsenstein,et al.  A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. , 1994, Molecular biology and evolution.

[22]  Journals unite for reproducibility , 2014, Nature.

[23]  M. Pagel,et al.  Bayesian estimation of ancestral character states on phylogenies. , 2004, Systematic biology.

[24]  Joaquín Dopazo,et al.  ETE: a python Environment for Tree Exploration , 2010, BMC Bioinformatics.

[25]  William Chen,et al.  Osiris: accessible and reproducible phylogenetic and phylogenomic analyses within the Galaxy workflow management system , 2014, BMC Bioinformatics.

[26]  Trung Le,et al.  CDAO-Store: Ontology-driven Data Integration for Phylogenetic Analysis , 2011, BMC Bioinformatics.

[27]  J. Ioannidis,et al.  Reproducibility in Science: Improving the Standard for Basic and Preclinical Research , 2015, Circulation research.

[28]  Alexander Isaev,et al.  PyEvolve: a toolkit for statistical modelling of molecular evolution , 2004, BMC Bioinformatics.

[29]  Luke J. Harmon,et al.  Best Practices for Data Sharing in Phylogenetic Research , 2014, PLoS currents.

[30]  Andrew F. Magee,et al.  The Dawn of Open Access to Phylogenetic Data , 2014, PloS one.

[31]  Jeet Sukumaran,et al.  DendroPy: a Python library for phylogenetic computing , 2010, Bioinform..

[32]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[33]  D. Penny The comparative method in evolutionary biology , 1992 .

[34]  L. Katz,et al.  Building a Phylogenomic Pipeline for the Eukaryotic Tree of Life - Addressing Deep Phylogenies with Genome-Scale Data , 2014, PLoS currents.

[35]  Ewan Birney,et al.  Automated generation of heuristics for biological sequence comparison , 2005, BMC Bioinformatics.

[36]  Rubén Sánchez,et al.  Phylemon 2.0: a suite of web-tools for molecular evolution, phylogenetics, phylogenomics and hypotheses testing , 2011, Nucleic Acids Res..

[37]  Peer Bork,et al.  PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments , 2006, Nucleic Acids Res..

[38]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[39]  Jean-Michel Claverie,et al.  Phylogeny.fr: robust phylogenetic analysis for the non-specialist , 2008, Nucleic Acids Res..

[40]  Florence Debarre,et al.  The Availability of Research Data Declines Rapidly with Article Age , 2013, Current Biology.

[41]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[42]  D. Weigel,et al.  Mating system shifts and transposable element evolution in the plant genus Capsella , 2014, BMC Genomics.

[43]  P. Higgs RNA secondary structure: physical and computational aspects , 2000, Quarterly Reviews of Biophysics.

[44]  David L. Robertson,et al.  Methodology capture: discriminating between the "best" and the rest of community practice , 2008, BMC Bioinformatics.

[45]  Donald E. Knuth,et al.  Literate Programming , 1984, Comput. J..

[46]  Claude E. Shannon,et al.  The mathematical theory of communication , 1950 .

[47]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[48]  Daniel J. Blankenberg,et al.  Galaxy: A Web‐Based Genome Analysis Tool for Experimentalists , 2010, Current protocols in molecular biology.

[49]  Felipe Zapata,et al.  Agalma: an automated phylogenomics workflow , 2013, BMC Bioinformatics.

[50]  Mark A. Miller,et al.  Creating the CIPRES Science Gateway for inference of large phylogenetic trees , 2010, 2010 Gateway Computing Environments Workshop (GCE).