The Dawn of Open Access to Phylogenetic Data

The scientific enterprise depends critically on the preservation of and open access to published data. This basic tenet applies acutely to phylogenies (estimates of evolutionary relationships among species). Increasingly, phylogenies are estimated from increasingly large, genome-scale datasets using increasingly complex statistical methods that require increasing levels of expertise and computational investment. Moreover, the resulting phylogenetic data provide an explicit historical perspective that critically informs research in a vast and growing number of scientific disciplines. One such use is the study of changes in rates of lineage diversification (speciation – extinction) through time. As part of a meta-analysis in this area, we sought to collect phylogenetic data (comprising nucleotide sequence alignment and tree files) from 217 studies published in 46 journals over a 13-year period. We document our attempts to procure those data (from online archives and by direct request to corresponding authors), and report results of analyses (using Bayesian logistic regression) to assess the impact of various factors on the success of our efforts. Overall, complete phylogenetic data for of these studies are effectively lost to science. Our study indicates that phylogenetic data are more likely to be deposited in online archives and/or shared upon request when: (1) the publishing journal has a strong data-sharing policy; (2) the publishing journal has a higher impact factor, and; (3) the data are requested from faculty rather than students. Importantly, our survey spans recent policy initiatives and infrastructural changes; our analyses indicate that the positive impact of these community initiatives has been both dramatic and immediate. Although the results of our study indicate that the situation is dire, our findings also reveal tremendous recent progress in the sharing and preservation of phylogenetic data.

[1]  M. Noor,et al.  Data Sharing: How Much Doesn't Get Submitted to GenBank? , 2006, PLoS biology.

[2]  R. O’Brien,et al.  A Caution Regarding Rules of Thumb for Variance Inflation Factors , 2007 .

[3]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[4]  M. Whitlock,et al.  The need for archiving data in evolutionary biology , 2010, Journal of evolutionary biology.

[5]  Heather A. Piwowar,et al.  Data reuse and the open data citation advantage , 2013, PeerJ.

[6]  S. Ceci,et al.  Private Archives and Public Needs. , 1983 .

[7]  M. Plummer,et al.  CODA: convergence diagnosis and output analysis for MCMC , 2006 .

[8]  Michael J. Sanderson,et al.  The Growth of Phylogenetic Information and the Need for a Phylogenetic Data Base , 1993 .

[9]  John Geweke,et al.  Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments , 1991 .

[10]  T. Vision Open Data and the Social Contract of Scientific Publishing , 2010 .

[11]  Andrew Gelman,et al.  General methods for monitoring convergence of iterative simulations , 1998 .

[12]  Simon Whelan,et al.  Class of multiple sequence alignment algorithm affects genomic analysis. , 2013, Molecular biology and evolution.

[13]  Heather A. Piwowar,et al.  Data archiving is a good investment , 2011, Nature.

[14]  Marc A. Suchard,et al.  Many-core algorithms for statistical phylogenetics , 2009, Bioinform..

[15]  Oliver G. Pybus,et al.  Testing macro–evolutionary models using incomplete molecular phylogenies , 2000, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[16]  J. Ioannidis,et al.  Public Availability of Published Research Data in High-Impact Journals , 2011, PloS one.

[17]  M. Whitlock Data archiving in ecology and evolution: best practices. , 2011, Trends in ecology & evolution.

[18]  Kirke Kicking Bird,et al.  A fair share , 2023, Nature.

[19]  A. Vickers,et al.  Empirical Study of Data Sharing by Authors Publishing in PLoS Journals , 2009, PloS one.

[20]  Florence Debarre,et al.  The Availability of Research Data Declines Rapidly with Article Age , 2013, Current Biology.

[21]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[22]  Michael J. Donoghue,et al.  A New Age of Discovery , 2000 .

[23]  Michael A. Somers,et al.  Tree of Life Web Project , 2003 .

[24]  D. Borsboom,et al.  The poor availability of psychological research data for reanalysis. , 2006, The American psychologist.

[25]  L. Rieseberg,et al.  Editorial and retrospective 2010 , 2010, Molecular ecology.

[26]  D. Maddison,et al.  The Tree of Life Web Project , 2007 .

[27]  J. Ioannidis,et al.  Unavailability of online supplementary scientific information from articles published in major journals , 2005, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[28]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[29]  M. Suchard,et al.  Bayesian Phylogenetics with BEAUti and the BEAST 1.7 , 2012, Molecular biology and evolution.

[30]  J. Wicherts,et al.  Willingness to Share Research Data Is Related to the Strength of the Evidence and the Quality of Reporting of Statistical Results , 2011, PloS one.

[31]  Hanna Kokko,et al.  Troubleshooting Public Data Archiving: Suggestions to Increase Participation , 2013, PLoS biology.

[32]  James G. Scott,et al.  Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables , 2012, 1205.0310.

[33]  Heather A. Piwowar,et al.  Sharing Detailed Research Data Is Associated with Increased Citation Rate , 2007, PloS one.

[34]  Peter Hill,et al.  A Fair Share , 1995 .

[35]  D. Morrison Why would phylogeneticists ignore computerized sequence alignment? , 2009, Systematic biology.

[36]  Michael C Whitlock,et al.  Data Archiving , 2010, The American Naturalist.

[37]  Heather A. Piwowar,et al.  Beginning to track 1000 datasets from public repositories into the published literature , 2011, ASIST.

[38]  M. Suchard,et al.  Alignment Uncertainty and Genomic Analysis , 2008, Science.

[39]  Olivier Poch,et al.  A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives , 2011, PloS one.

[40]  Marcy K. Uyenoyama MBE Editor's Report , 2010 .

[41]  K. Sjölander,et al.  Taking the first steps towards a standard for reporting on phylogenies: Minimum Information About a Phylogenetic Analysis (MIAPA). , 2006, Omics : a journal of integrative biology.

[42]  Joseph Hughes TreeRipper web application: towards a fully automated optical tree recognition software , 2011, BMC Bioinformatics.

[43]  Cédric Notredame,et al.  Recent Evolutions of Multiple Sequence Alignment Algorithms , 2007, PLoS Comput. Biol..

[44]  Daniel L Rabosky,et al.  LIKELIHOOD METHODS FOR DETECTING TEMPORAL SHIFTS IN DIVERSIFICATION RATES , 2006, Evolution; international journal of organic evolution.

[45]  Heather A. Piwowar,et al.  Altmetrics: Value all research products , 2013, Nature.

[46]  Arlin Stoltzfus,et al.  Sharing and re-use of phylogenetic trees (and associated data) to facilitate synthesis , 2012, BMC Research Notes.

[47]  Keith A. Crandall,et al.  Lost Branches on the Tree of Life , 2013, PLoS biology.

[48]  Sam Yeaman,et al.  Mandated data archiving greatly improves access to research data , 2013, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[49]  Wendy W Chapman,et al.  Recall and bias of retrieving gene expression microarray datasets through PubMed identifiers , 2010, Journal of biomedical discovery and collaboration.

[50]  Heather A. Piwowar,et al.  Who Shares? Who Doesn't? Factors Associated with Openly Archiving Raw Research Data , 2011, PloS one.