Accelerating data-driven discovery with scientific asset management

The overhead and burden of managing data in complex discovery processes involving experimental protocols with numerous data-producing and computational steps has become the gating factor that determines the pace of discovery. The lack of comprehensive systems to capture, manage, organize and retrieve data throughout the discovery life cycle leads to significant overheads on scientists' time and effort, reduced productivity, lack of reproducibility, and an absence of data sharing. In “creative fields” like digital photography and music, digital asset management (DAM) systems for capturing, managing, curating and consuming digital assets like photos and audio recordings, have fundamentally transformed how these data are used. While asset management has not taken hold in eScience applications, we believe that transformation similar to that observed in the creative space could be achieved in scientific domains if appropriate ecosystems of asset management tools existed to capture, manage, and curate data throughout the scientific discovery process. In this paper, we introduce DERIVA, a framework and infrastructure for asset management in eScience and present initial results from its usage in active research use cases.

[1]  K. Eliceiri,et al.  Bioimage informatics for experimental biology. , 2009, Annual review of biophysics.

[2]  Begley Cg,et al.  Ocean science: Arctic sea ice needs better forecasts , 2013, Nature.

[3]  Inna Kouper,et al.  SEAD Virtual Archive: Building a Federation of Institutional Repositories for Long-Term Data Preservation in Sustainability Science , 2013, Int. J. Digit. Curation.

[4]  Arturo Casadevall,et al.  Correction: Why Has the Number of Scientific Retractions Increased? , 2013, PLoS ONE.

[5]  Leilani Battle,et al.  Database-as-a-Service for Long-Tail Science , 2011, SSDBM.

[6]  Carole A. Goble,et al.  Micropublications: a semantic model for claims, evidence, arguments and annotations in biomedical communications , 2013, Journal of Biomedical Semantics.

[7]  Arturo Casadevall,et al.  Why Has the Number of Scientific Retractions Increased? , 2013, PloS one.

[8]  Luc Moreau,et al.  The Foundations for Provenance on the Web , 2010, Found. Trends Web Sci..

[9]  David J. DeWitt,et al.  Scientific data management in the coming decade , 2005, SGMD.

[10]  Carole A. Goble,et al.  Accelerating Scientists' Knowledge Turns , 2011, KDIR.

[11]  C. Begley,et al.  Reproducibility: Six red flags for suspect work , 2013, Nature.

[12]  MacKenzie Smith,et al.  DSpace: An Open Source Dynamic Digital Repository , 2003, D Lib Mag..

[13]  Carl Kesselman,et al.  Grid-based metadata services , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[14]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[15]  Malcolm Atkinson,et al.  Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM’04) , 2004 .

[16]  Brian L Claus,et al.  Discovery informatics: its evolving role in drug discovery. , 2002, Drug discovery today.

[17]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[18]  Timothy R. Olsen,et al.  The Extensible Neuroimaging Archive Toolkit: an informatics platform for managing, exploring, and sharing neuroimaging data. , 2007, Neuroinformatics.

[19]  Jeffrey Heer,et al.  Enterprise Data Analysis and Visualization: An Interview Study , 2012, IEEE Transactions on Visualization and Computer Graphics.

[20]  Ian T. Foster,et al.  Globus Data Publication as a Service: Lowering Barriers to Reproducible Science , 2015, 2015 IEEE 11th International Conference on e-Science.

[21]  Leilani Battle,et al.  Automatic example queries for ad hoc databases , 2011, SIGMOD '11.

[22]  Carole A. Goble,et al.  Why Linked Data is Not Enough for Scientists , 2010, 2010 IEEE Sixth International Conference on e-Science.

[23]  David Maier,et al.  Principles of dataspace systems , 2006, PODS '06.

[24]  Shrainik Jain,et al.  SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment , 2016, SIGMOD Conference.

[25]  Daniel S. Marcus,et al.  The extensible neuroimaging archive toolkit , 2007, Neuroinformatics.

[26]  Peter Fox,et al.  The Science of Data Science , 2014, Big Data.