Decentralized provenance-aware publishing with nanopublications

16 Publication and archival of scientific results is still commonly considered the responsability of classical publishing companies. Classical forms of publishing, however, which center around printed narrative articles, no longer seem well-suited in the digital age. In particular, there exist currently no efficient, reliable, and agreed-upon methods for publishing scientific datasets, which have become increasingly important for science. In this article, we propose to design scientific data publishing as a Web-based bottom-up process, without top-down control of central authorities such as publishing companies. Based on a novel combination of existing concepts and technologies, we present a server network to decentrally store and archive data in the form of nanopublications, an RDF-based format to represent scientific data. We show how this approach allows researchers to publish, retrieve, verify, and recombine datasets of nanopublications in a reliable and trustworthy manner, and we argue that this architecture could be used as a low-level data publication layer to serve the Semantic Web in general. Our evaluation of the current network shows that this system is efficient and reliable. 17

[1]  Timothy W. Finin,et al.  RDF123: From Spreadsheets to RDF , 2008, SEMWEB.

[2]  Richard Freedman The Renaissance chanson goes digital: digitalduchemin.org , 2014 .

[3]  Jeremy J. Carroll,et al.  Named graphs, provenance and trust , 2005, WWW '05.

[4]  Daniel P. Miranker,et al.  On directly mapping relational databases to RDF and OWL , 2012, WWW.

[5]  Oscar Corcho,et al.  Workflow-centric research objects: First class citizens in scholarly discourse. , 2012, ESWC 2012.

[6]  Jürgen Umbrich,et al.  SPARQL Web-Querying Infrastructure: Ready for Action? , 2013, SEMWEB.

[7]  Michel Dumontier,et al.  Provenance-Centered Dataset of Drug-Drug Interactions , 2015, SEMWEB.

[8]  Gary D. Bader,et al.  Dataset Descriptions: HCLS Community Profile , 2015 .

[9]  Ryan Shaw,et al.  Nanopublication beyond the sciences: the PeriodO period gazetteer , 2016, PeerJ Comput. Sci..

[10]  Michel Dumontier,et al.  Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data , 2014, ESWC.

[11]  Barend Mons,et al.  Open PHACTS: semantic interoperability for drug discovery. , 2012, Drug discovery today.

[12]  Van Jacobson,et al.  Networking named content , 2009, CoNEXT '09.

[13]  Andreas Harth,et al.  CumulusRDF: Linked Data Management on Nested Key-Value Stores , 2011 .

[14]  Françoise Baude,et al.  A Survey of Structured P2P Systems for RDF Data Storage and Retrieval , 2011, Trans. Large Scale Data Knowl. Centered Syst..

[15]  Michel Dumontier,et al.  Making Digital Artifacts on the Web Verifiable and Reliable , 2015, IEEE Transactions on Knowledge and Data Engineering.

[16]  Barend Mons,et al.  Converting neXtProt into Linked Data and nanopublications , 2015, Semantic Web.

[17]  Joseph Paul Cohen,et al.  Academic Torrents: A Community-Maintained Distributed Repository , 2014, XSEDE '14.

[18]  Chris Markman,et al.  BitTorrent and Libraries: Cooperative Data Publishing, Management and Discovery , 2014, D Lib Mag..

[19]  Tobias Kuhn Science Bots: A Model for the Future of Scientific Computation? , 2015, WWW.

[20]  Tobias Kuhn,et al.  nanopub-java: A Java Library for Nanopublications , 2015, LISC@ISWC.

[21]  Michael Krauthammer,et al.  Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications , 2013, DILS.

[22]  Norbert E. Fuchs,et al.  Improving Text Mining with Controlled Natural Language: A Case Study for Protein Interactions , 2006, DILS.

[23]  Michael Krauthammer,et al.  Publishing without Publishers: a Decentralized Approach to Dissemination, Retrieval, and Archiving of Data , 2014, SEMWEB.

[24]  Olaf Hartig,et al.  An Overview on Execution Strategies for Linked Data Queries , 2013, Datenbank-Spektrum.

[25]  Ryan Shaw,et al.  Nanopublication beyond the Sciences , 2015, PeerJ Prepr..

[26]  Michael Krauthammer,et al.  Broadening the Scope of Nanopublications , 2013, ESWC.

[27]  Paul Groth The Anatomy of a Nano-publication , 2010 .

[28]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[29]  Paul T. Groth,et al.  Querying neXtProt nanopublications and their value for insights on sequence variants and tissue expression , 2014, J. Web Semant..

[30]  Norman Paskin,et al.  Digital Object Identifiers for scientific data , 2005, Data Sci. J..

[31]  Elaine Shi,et al.  Permacoin: Repurposing Bitcoin Work for Data Preservation , 2014, 2014 IEEE Symposium on Security and Privacy.

[32]  Ian Clarke,et al.  Freenet: A Distributed Anonymous Information Storage and Retrieval System , 2000, Workshop on Design Issues in Anonymity and Unobservability.

[33]  Tobias Kuhn,et al.  A Survey and Classification of Controlled Natural Languages , 2014, CL.

[34]  Núria Queralt-Rosinach,et al.  Publishing DisGeNET as Nanopublications , 2014, bioRxiv.

[35]  Rob W.W. Hooft,et al.  The value of data , 2011, Nature Genetics.

[36]  D. Cooper,et al.  Microattribution and nanopublication as means to incentivize the placement of human genome variation data into the public domain , 2012, Human mutation.

[37]  Andreas Rauber,et al.  A Scalable Framework for Dynamic Data Citation of Arbitrary Structured Data , 2014, DATA.

[38]  David Mazières,et al.  Fast and secure distributed read-only file system , 2000, TOCS.

[39]  Rik Van de Walle,et al.  Querying Datasets on the Web with High Availability , 2014, SEMWEB.

[40]  John Bradley,et al.  Documents and Data: Modelling Materials for Humanities Research in XML and Relational Databases , 2005, Lit. Linguistic Comput..