Quantifying Reproducibility in Computational Biology: The Case of the Tuberculosis Drugome

How easy is it to reproduce the results found in a typical computational biology paper? Either through experience or intuition the reader will already know that the answer is with difficulty or not at all. In this paper we attempt to quantify this difficulty by reproducing a previously published paper for different classes of users (ranging from users with little expertise to domain experts) and suggest ways in which the situation might be improved. Quantification is achieved by estimating the time required to reproduce each of the steps in the method described in the original paper and make them part of an explicit workflow that reproduces the original results. Reproducing the method took several months of effort, and required using new versions and new software that posed challenges to reconstructing and validating the results. The quantification leads to “reproducibility maps” that reveal that novice researchers would only be able to reproduce a few of the steps in the method, and that only expert researchers with advance knowledge of the domain would be able to reproduce the method in its entirety. The workflow itself is published as an online resource together with supporting software and data. The paper concludes with a brief discussion of the complexities of requiring reproducibility in terms of cost versus benefit, and a desiderata with our observations and guidelines for improving reproducibility. This has implications not only in reproducing the work of others from published papers, but reproducing work from one’s own laboratory.

[1]  Victoria Stodden,et al.  The Legal Framework for Reproducible Scientific Research: Licensing and Copyright , 2009, Computing in Science & Engineering.

[2]  Z. Boka Enhancing reproducibility , 2013, Nature Methods.

[3]  Yolanda Gil,et al.  A new approach for publishing workflows: abstractions, standards, and linked data , 2011, WORKS '11.

[4]  Paul Pinsky,et al.  Transparency and reproducibility in data analysis: the Prostate Cancer Prevention Trial. , 2010, Biostatistics.

[5]  Mirina Grosz,et al.  World Wide Web Consortium , 2010 .

[6]  Orr Ravitz,et al.  Improving molecular docking through eHiTS’ tunable scoring function , 2011, J. Comput. Aided Mol. Des..

[7]  Philip E. Bourne,et al.  The Mycobacterium tuberculosis Drugome and Its Polypharmacological Implications , 2010, PLoS Comput. Biol..

[8]  Philip J. Guo CDE: A Tool for Creating Portable Experimental Software Packages , 2012, Computing in Science & Engineering.

[9]  B. Obama Executive Order 13642: Making Open and Machine Readable the New Default for Government Information , 2013 .

[10]  Lei Xie,et al.  Detecting evolutionary relationships across existing fold space, using sequence order-independent profile–profile alignments , 2008, Proceedings of the National Academy of Sciences.

[11]  Jeffrey Nichols,et al.  RepliCHI SIG: from a panel to a new submission venue for replication , 2012, CHI Extended Abstracts.

[12]  Matthias Schwab,et al.  Making scientific computations reproducible , 2000, Comput. Sci. Eng..

[13]  Andreas Prlic,et al.  Pre-calculated protein structure alignments at the RCSB PDB website , 2010, Bioinform..

[14]  David M. Shotton,et al.  Improving Future Research Communication and e-Scholarship: A Summary of Findings , 2012 .

[15]  E. Ziegel COMPSTAT: Proceedings in Computational Statistics , 1988 .

[16]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[17]  Dan Brickley,et al.  Rdf vocabulary description language 1.0 : Rdf schema , 2004 .

[18]  G. Naik Scientists' Elusive Goal: Reproducing Study Results , 2011 .

[19]  Laure Huot,et al.  Visibility of retractions: a cross-sectional one-year study , 2013, BMC Research Notes.

[20]  Jon F. Claerbout,et al.  Electronic documents give reproducible research a new meaning: 62nd Ann , 1992 .

[21]  Marc A. Martí-Renom,et al.  MODBASE: a database of annotated comparative protein structure models and associated resources , 2005, Nucleic Acids Res..

[22]  Philip E. Bourne,et al.  Computational Biology Resources Lack Persistence and Usability , 2008, PLoS Comput. Biol..

[23]  K. Coombes,et al.  Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology , 2009, 1010.1092.

[24]  Seth Falcon Caching code chunks in dynamic documents , 2009, Comput. Stat..

[25]  Deborah L. McGuinness,et al.  PROV-O: The PROV Ontology , 2013 .

[26]  Dennis Gannon,et al.  Workflows for e-Science, Scientific Workflows for Grids , 2014 .

[27]  Yogesh L. Simmhan,et al.  The Open Provenance Model core specification (v1.1) , 2011, Future Gener. Comput. Syst..

[28]  Paul Watson,et al.  Special Issue , 2008, Concurr. Comput. Pract. Exp..

[29]  Luc Moreau,et al.  SPECIAL ISSUE: THE THIRD PROVENANCE CHALLENGE ON USING THE OPEN PROVENANCE MODEL FOR INTEROPERABILITY , 2011 .

[30]  Ian Foster,et al.  The First Provenance Challenge , 2008 .

[31]  Friedrich Leisch,et al.  Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis , 2002, COMPSTAT.

[32]  Jill P Mesirov,et al.  Accessible Reproducible Research , 2010, Science.

[33]  Roger D Peng,et al.  Reproducible research and Biostatistics. , 2009, Biostatistics.

[34]  Paul T. Groth,et al.  Wings: Intelligent Workflow-Based Design of Computational Experiments , 2011, IEEE Intelligent Systems.

[35]  Gwendolyn Halford,et al.  CSIS Website: Center for Strategic and International Studies , 2000 .

[36]  Bill Howe,et al.  CDE: A Tool for Creating Portable Experimental Software Packages , 2012 .

[37]  Ben M. Webb,et al.  ModBase, a database of annotated comparative protein structure models and associated resources , 2013, Nucleic Acids Res..

[38]  Tom Heath,et al.  Linked Data: Evolving the Web into a Global Data Space , 2011, Linked Data.

[39]  A. Casadevall,et al.  Retracted Science and the Retraction Index , 2011, Infection and Immunity.

[40]  Yogesh L. Simmhan,et al.  Special Section: The third provenance challenge on using the open provenance model for interoperability , 2011, Future Gener. Comput. Syst..

[41]  Robert E. Kearney,et al.  A HUPO test sample study reveals common problems in mass spectrometry-based proteomics , 2009, Nature Methods.

[42]  C. Ball,et al.  Repeatability of published microarray gene expression analyses , 2009, Nature Genetics.

[43]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[44]  Torsten Hothorn,et al.  Case studies in reproducibility , 2011, Briefings Bioinform..

[45]  Geoffrey C. Fox,et al.  Examining the Challenges of Scientific Workflows , 2007, Computer.

[46]  Jelena Kovacevic,et al.  Reproducible research in signal processing , 2009, IEEE Signal Process. Mag..

[47]  Ioana Manolescu,et al.  The repeatability experiment of SIGMOD 2008 , 2008, SGMD.

[48]  Yolanda Gil,et al.  A semantic framework for automatic generation of computational workflows using distributed data and component catalogues , 2011, J. Exp. Theor. Artif. Intell..

[49]  Philippe Bonnet,et al.  Repeatability and workability evaluation of SIGMOD 2011 , 2011, SGMD.

[50]  G. Magoon,et al.  Discovery of Western European R1b1a2 Y Chromosome Variants in 1000 Genomes Project Data: An Online Community Approach , 2012, PloS one.

[51]  Edward J. Callahan,et al.  Illuminating the 'Black Box' , 1998 .

[52]  Philip E. Bourne,et al.  What Do I Want from the Publisher of the Future? , 2010, PLoS Comput. Biol..

[53]  James Cheney,et al.  PROV-O: The PROV ontology:W3C recommendation 30 April 2013 , 2013 .

[54]  Yogesh L. Simmhan,et al.  Special Issue: The First Provenance Challenge , 2008, Concurr. Comput. Pract. Exp..