ReproPhylo: An Environment for Reproducible Phylogenomics

The reproducibility of experiments is key to the scientific process, and particularly necessary for accurate reporting of analyses in data-rich fields such as phylogenomics. We present ReproPhylo, a phylogenomic analysis environment developed to ensure experimental reproducibility, to facilitate the handling of large-scale data, and to assist methodological experimentation. Reproducibility, and instantaneous repeatability, is built in to the ReproPhylo system, and does not require user intervention or configuration because it stores the experimental workflow as a single, serialized Python object containing explicit provenance and environment information. This ‘single file’ approach ensures the persistence of provenance across iterations of the analysis, with changes automatically managed by the version control program Git. ReproPhylo produces an extensive human-readable report, and generates a comprehensive experimental archive file, both of which are suitable for submission with publications. The system facilitates thorough experimental exploration of both parameters and data. ReproPhylo is a platform independent CC0 python module, and is easily installed as a Docker image, with an Jupyter GUI, or as a slimmer version in a Galaxy distribution.

[1]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[2]  Peer Bork,et al.  PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments , 2006, Nucleic Acids Res..

[3]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[4]  Jean-Michel Claverie,et al.  Phylogeny.fr: robust phylogenetic analysis for the non-specialist , 2008, Nucleic Acids Res..

[5]  Florence Debarre,et al.  The Availability of Research Data Declines Rapidly with Article Age , 2013, Current Biology.

[6]  Jeet Sukumaran,et al.  DendroPy: a Python library for phylogenetic computing , 2010, Bioinform..

[7]  Charles F. Hockett,et al.  A mathematical theory of communication , 1948, MOCO.

[8]  P. Higgs RNA secondary structure: physical and computational aspects , 2000, Quarterly Reviews of Biophysics.

[9]  David L. Robertson,et al.  Methodology capture: discriminating between the "best" and the rest of community practice , 2008, BMC Bioinformatics.

[10]  W. S. Valdar,et al.  Scoring residue conservation , 2002, Proteins.

[11]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[12]  Toni Gabaldón,et al.  trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses , 2009, Bioinform..

[13]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[14]  Daniel J. Blankenberg,et al.  Galaxy: A Web‐Based Genome Analysis Tool for Experimentalists , 2010, Current protocols in molecular biology.

[15]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[16]  Andy Purvis,et al.  phyloGenerator: an automated phylogeny generation tool for ecologists , 2013 .

[17]  D. Weigel,et al.  Mating system shifts and transposable element evolution in the plant genus Capsella , 2014, BMC Genomics.

[18]  J. Felsenstein,et al.  A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. , 1994, Molecular biology and evolution.

[19]  Gerard Talavera,et al.  Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. , 2007, Systematic biology.

[20]  Andrew F. Magee,et al.  The Dawn of Open Access to Phylogenetic Data , 2014, PloS one.

[21]  Journals unite for reproducibility , 2014, Nature.

[22]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[23]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[24]  M. Pagel Detecting correlated evolution on phylogenies: a general method for the comparative analysis of discrete characters , 1994, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[25]  A. Kawahara,et al.  Phylogenomics provides strong evidence for relationships of butterflies and moths , 2014, Proceedings of the Royal Society B: Biological Sciences.

[26]  Nicolas Lartillot,et al.  PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating , 2009, Bioinform..

[27]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[28]  Joaquín Dopazo,et al.  ETE: a python Environment for Tree Exploration , 2010, BMC Bioinformatics.

[29]  A. Lemmon,et al.  Anchored hybrid enrichment for massively high-throughput phylogenomics. , 2012, Systematic biology.

[30]  M. Pagel,et al.  Bayesian estimation of ancestral character states on phylogenies. , 2004, Systematic biology.

[31]  William Chen,et al.  Osiris: accessible and reproducible phylogenetic and phylogenomic analyses within the Galaxy workflow management system , 2014, BMC Bioinformatics.

[32]  L. Katz,et al.  Building a Phylogenomic Pipeline for the Eukaryotic Tree of Life - Addressing Deep Phylogenies with Genome-Scale Data , 2014, PLoS currents.

[33]  Ewan Birney,et al.  Automated generation of heuristics for biological sequence comparison , 2005, BMC Bioinformatics.

[34]  E. Martins The Comparative Method in Evolutionary Biology, Paul H. Harvey, Mark D. Pagel. Oxford University Press, Oxford (1991), vii, + 239 Price $24.95 paperback , 1992 .

[35]  Rubén Sánchez,et al.  Phylemon 2.0: a suite of web-tools for molecular evolution, phylogenetics, phylogenomics and hypotheses testing , 2011, Nucleic Acids Res..

[36]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[37]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[38]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[39]  Donald E. Knuth,et al.  Literate Programming , 1984, Comput. J..

[40]  J. Ioannidis,et al.  Reproducibility in Science: Improving the Standard for Basic and Preclinical Research , 2015, Circulation research.

[41]  Felipe Zapata,et al.  Agalma: an automated phylogenomics workflow , 2013, BMC Bioinformatics.

[42]  Mark A. Miller,et al.  Creating the CIPRES Science Gateway for inference of large phylogenetic trees , 2010, 2010 Gateway Computing Environments Workshop (GCE).

[43]  S. Kalisz,et al.  A ROLE FOR NONADAPTIVE PROCESSES IN PLANT GENOME SIZE EVOLUTION? , 2010, Evolution; international journal of organic evolution.