The Road Towards Reproducibility in Science: The Case of Data Citation

Data citation has a profound impact on the reproducibility of science, a hot topic in many disciplines such as as astronomy, biology, physics, computer science and more. Lately, several authoritative journals have been requesting the sharing of data and the provision of validation methodologies for experiments (e.g., Nature Scientific Data and Nature Physics); these publications and the publishing industry in general see data citation as the means to provide new, reliable and usable means for sharing and referring to scientific data. In this paper, we present the state of the art of data citation and we discuss open issues and research directions with a specific focus on reproducibility. Furthermore, we investigate reproducibility issues by using experimental evaluation in Information Retrieval (IR) as a test case. (This paper is a revised and extended version of [33, 35, 57]).

[1]  Peter Buneman,et al.  A Rule-Based Citation System for Structured and Evolving Datasets , 2010, IEEE Data Eng. Bull..

[2]  Nicola Ferro,et al.  Managing the Knowledge Creation Process of Large-Scale Evaluation Campaigns , 2009, ECDL.

[3]  Elaine G. Toms,et al.  Information Access Evaluation. Multilinguality, Multimodality, and Interaction , 2014 .

[4]  Paul Buitelaar,et al.  Semantic representation and enrichment of information retrieval experimental data , 2017, International Journal on Digital Libraries.

[5]  Andreas Rauber,et al.  Scalable data citation in dynamic, large databases: Model and reference implementation , 2013, 2013 IEEE International Conference on Big Data.

[6]  Nicola Ferro,et al.  Rank-Biased Precision Reloaded: Reproducibility and Generalization , 2015, ECIR.

[7]  David De Roure,et al.  The future of scholarly communications , 2014 .

[8]  Allan Hanbury,et al.  An Information Retrieval Ontology for Information Retrieval Nanopublications , 2014, CLEF.

[9]  Daniel Deutch,et al.  A Model for Fine-Grained Data Citation , 2017, CIDR.

[10]  Maarten Hoogerwerf,et al.  Enhanced Publications : Linking Publications and Research Data in Digital Repositories , 2009 .

[11]  Yi-Hung Huang,et al.  Citing a Data Repository: A Case Study of the Protein Data Bank , 2015, PloS one.

[12]  Nicola Ferro,et al.  Reproducibility Challenges in Information Retrieval Evaluation , 2017, ACM J. Data Inf. Qual..

[13]  Giuseppe Santucci,et al.  Measuring and Analyzing the Scholarly Impact of Experimental Evaluation Initiatives , 2014, IRCDL.

[14]  K. Baggerly Disclose all data in publications. , 2010, Nature.

[15]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[16]  Keishi Tajima,et al.  Archiving scientific data , 2004, TODS.

[17]  Gabriella Kazai,et al.  Advances in Information Retrieval , 2015, Lecture Notes in Computer Science.

[18]  Yvonne M. Socha,et al.  OUT OF CITE, OUT OF MIND: THE CURRENT STATE OF PRACTICE, POLICY, AND TECHNOLOGY FOR THE CITATION OF DATA CODATA-ICSTI Task Group on Data Citation Standards and Practices , 2013 .

[19]  Fabian Steeg,et al.  Information-Retrieval: Evaluation , 2010 .

[20]  Noriko Kando,et al.  Increasing Reproducibility in IR: Findings from the Dagstuhl Seminar on "Reproducibility of Data-Oriented Experiments in e-Science" , 2016, SIGIR Forum.

[21]  Paul Clough,et al.  Information Access Evaluation. Multilinguality, Multimodality, and Interaction , 2014, Lecture Notes in Computer Science.

[22]  Jimmy J. Lin,et al.  Evaluation-as-a-Service: Overview and Outlook , 2015, ArXiv.

[23]  Evaristo Jiménez-Contreras,et al.  Analyzing data citation practices using the data citation index , 2015, J. Assoc. Inf. Sci. Technol..

[24]  Gianmaria Silvello A Methodology for Citing Linked Open Data Subsets , 2015, D Lib Mag..

[25]  Christine L. Borgman,et al.  The conundrum of sharing research data , 2012, J. Assoc. Inf. Sci. Technol..

[26]  Nicola Ferro,et al.  Towards Open-Source Shared Implementations of Keyword-Based Access Systems to Relational Data , 2017, EDBT/ICDT Workshops.

[27]  Evaristo Jiménez-Contreras,et al.  Analyzing data citation practices according to the Data Citation Index , 2015, ArXiv.

[28]  James Frew,et al.  Why data citation is a computational problem , 2016, Commun. ACM.

[29]  Nicola Ferro,et al.  Keyword-based access to relational data: To reproduce, or to not reproduce? , 2017, SEBD.

[30]  Alistair Moffat,et al.  EvaluatIR: an online tool for evaluating and comparing IR systems , 2009, SIGIR.

[31]  Paul T. Groth,et al.  The anatomy of a nanopublication , 2010, Inf. Serv. Use.

[32]  Andreas Rauber,et al.  Asking the Right Questions - Query-Based Data Citation to Precisely Identify Subsets of Data , 2015, ERCIM News.

[33]  Paolo Manghi,et al.  On Bridging Data Centers and Publishers: The Data-Literature Interlinking Service , 2015, MTSR.

[34]  Omar Alonso,et al.  Using crowdsourcing for TREC relevance assessment , 2012, Inf. Process. Manag..

[35]  Daniel Deutch,et al.  Data Citation: A Computational Challenge , 2017, PODS.

[36]  Gianmaria Silvello,et al.  Learning to cite framework: How to automatically construct citations for hierarchical data , 2017, J. Assoc. Inf. Sci. Technol..

[37]  D. Carr,et al.  Sharing Research Data to Improve Public Health , 2015, Journal of empirical research on human research ethics : JERHRE.

[38]  Juliana Freire,et al.  Reproducibility of Data-Oriented Experiments in e-Science (Dagstuhl Seminar 16041) , 2016, Dagstuhl Reports.

[39]  Paolo Manghi,et al.  Data journals: A survey , 2014, J. Assoc. Inf. Sci. Technol..

[40]  Elaine Toms,et al.  CLEF 2014 , 2014, SIGF.

[41]  Micah Altman,et al.  A Proposed Standard for the Scholarly Citation of Quantitative Data , 2007, IASSIST Conference.

[42]  Sarah Callaghan,et al.  Joint declaration of data citation principles , 2014 .

[43]  Ellen M. Voorhees,et al.  Promoting Repeatability Through Open Runs , 2016, EVIA@NTCIR.

[44]  Christine L Borgman,et al.  Why are the attribution and citation of scientific data important? In: Uhlir, Paul and Cohen, Daniel (eds.). Report from Developing Data Attribution and Citation Practices and Standards: An International Symposium and Workshop. , 2012 .

[45]  Nicola Ferro,et al.  "Data Citation is Coming". Introduction to the Special Issue on Data Citation , 2016, Bull. IEEE Tech. Comm. Digit. Libr..

[46]  Giorgio Maria Di Nunzio,et al.  The Importance of Scientific Data Curation for Evaluation Campaigns , 2007, DELOS.

[47]  Jens Klump,et al.  DOI for geoscience data - how early practices shape present perceptions , 2016, Earth Science Informatics.

[48]  Paolo Manghi,et al.  A Framework Supporting the Shift from Traditional Digital Publications to Enhanced Publications , 2015, D Lib Mag..

[49]  Andrew Trotman,et al.  Report on the SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR) , 2016, SIGF.

[50]  Nicola Ferro,et al.  DIRECTions: Design and Specification of an IR Evaluation Infrastructure , 2012, CLEF.

[51]  Natasha Simons,et al.  Implementing DOIs for Research Data , 2012, D Lib Mag..

[52]  Vassilis Christophides,et al.  High-level change detection in RDF(S) KBs , 2013, TODS.

[53]  Benno Stein,et al.  Improving the Reproducibility of PAN's Shared Tasks: - Plagiarism Detection, Author Identification, and Author Profiling , 2014, CLEF.

[54]  Mercè Crosas,et al.  The Evolution of Data Citation: From Principles to Implementation , 2014 .

[55]  Philippe Bonnet,et al.  Computational reproducibility: state-of-the-art, challenges, and database research opportunities , 2012, SIGMOD Conference.

[56]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[57]  Julio Gonzalo,et al.  Overview of RepLab 2012: Evaluating Online Reputation Management Systems , 2012, CLEF.

[58]  Nicola Ferro,et al.  6 – Towards an infrastructure for digital library performance evaluation , 2009 .