DataFed: Towards Reproducible Research via Federated Data Management

The increasingly collaborative, globalized nature of scientific research combined with the need to share data and the explosion in data volumes present an urgent need for a scientific data management system (SDMS). An SDMS presents a logical and holistic view of data that greatly simplifies and empowers data organization, curation, searching, sharing, dissemination, etc. We present DataFed - a lightweight, distributed SDMS that spans a federation of storage systems within a loosely-coupled network of scientific facilities. Unlike existing SDMS offerings, DataFed uses high-performance and scalable user management and data transfer technologies that simplify deployment, maintenance, and expansion of DataFed. DataFed provides web-based and command-line interfaces to manage data and integrate with complex scientific workflows. DataFed represents a step towards reproducible scientific research by enabling reliable staging of the correct data at the desired environment.

[1]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[2]  Graeme Stewart,et al.  Rucio – The next generation of large scale distributed system for ATLAS Data Management , 2014 .

[3]  Ian T. Foster,et al.  Globus Data Publication as a Service: Lowering Barriers to Reproducible Science , 2015, 2015 IEEE 11th International Conference on e-Science.

[4]  Arcot Rajasekar,et al.  iRODS: A Distributed Data Management Cyberinfrastructure for Observatories , 2007 .

[5]  Kyle Chard,et al.  Globus: A Case Study in Software as a Service for Scientists , 2017 .

[6]  Brian A. Nosek,et al.  Promoting an open research culture , 2015, Science.

[7]  Lavanya Ramakrishnan,et al.  High performance data management and analysis for tomography , 2014, Optics & Photonics - Optical Engineering + Applications.

[8]  Steven Tuecke,et al.  GridFTP: Protocol Extensions to FTP for the Grid , 2001 .

[9]  Yan Zhao,et al.  Clowder: Open Source Data Management for Long Tail Data , 2018, PEARC.

[10]  Abbe Mowshowitz,et al.  Virtual organization , 1997, CACM.

[11]  Fangfang Xia,et al.  The DOE Systems Biology Knowledgebase (KBase) , 2016, bioRxiv.

[12]  Reagan Moore Data Management Systems for Scientific Applications , 2000, The Architecture of Scientific Software.

[13]  Chaomei Chen,et al.  Big, Deep, and Smart Data in Scanning Probe Microscopy. , 2016, ACS nano.

[14]  David W. Chadwick,et al.  Federated Identity Management , 2009, FOSAD.

[15]  Tim Furche,et al.  Data Wrangling for Big Data: Challenges and Opportunities , 2016, EDBT.

[16]  A. Szewczak,et al.  High-Throughput Quality Control of DMSO Acoustic Dispensing Using Photometric Dye Methods , 2013, Journal of laboratory automation.

[17]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.