Content-based comparison for collections identification

Assigning global unique persistent identifiers (GUPIs) to datasets has the goal of improving their accessibility and simplifying how they are referenced and reused. However, as repositories receive more and complex data, attesting for the identity of datasets attached to persistent identifiers over time is becoming more challenging. This is due to the nature of scientific research data, which is generated through distributed research practices and evolves across different computational environments. This work presents a robust, automated computational service for data content comparison as a valuable addition to assigning, managing, and tracking persistent identifiers. We operationalized the functions of the service within the archival space by linking data provenance and identity to authenticity. The need for such service is shown through three genomics data use cases in which the results aided curators establishing the identity of datasets and inferring issues of provenance. We describe the system's design, implementation and performance, and report on lessons learned.

[1]  Daniel C. Stanzione,et al.  Building Wrangler: A transformational data intensive resource for the open science community , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[2]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[3]  Clifford Lynch,et al.  Authenticity and Integrity in the Digital Environment: an exploratory analysis of the central role of trust , 2013 .

[4]  Laura Wynholds,et al.  Linking to Scientific Data: Identity Problems of Unruly and Poorly Bounded Digital Objects , 2011, Int. J. Digit. Curation.

[5]  Erez Zadok,et al.  Ensuring data integrity in storage: techniques and applications , 2005, StorageSS '05.

[6]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[7]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[8]  Xiao Sun,et al.  Data access for the 1,000 Plants (1KP) project , 2014, GigaScience.

[9]  Eric Grosse,et al.  Repository mirroring , 1995, TOMS.

[10]  Jawon Song,et al.  Examining the Causes and Consequences of Context-Specific Differential DNA Methylation in Maize1[OPEN] , 2015, Plant Physiology.

[11]  Daniel C. Stanzione,et al.  Wrangler's user environment: A software framework for management of data-intensive computing system , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[12]  Lin Fang,et al.  Resequencing 50 accessions of cultivated and wild rice yields markers for identifying agronomically important genes , 2011, Nature Biotechnology.

[13]  Gad M. Landau,et al.  An Efficient Algorithm for the All Pairs Suffix-Prefix Problem , 1992, Inf. Process. Lett..

[14]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[15]  Shahar Ronen,et al.  Authenticity and Provenance in Long Term Digital Preservation: Modeling and Implementation in Preservation Aware Storage , 2009, Workshop on the Theory and Practice of Provenance.

[16]  John Kunze,et al.  Community Next Steps for Making Globally Unique Identifiers Work for Biocollections Data , 2015, ZooKeys.

[17]  M. Lorieux,et al.  Whole Genome Sequencing of Elite Rice Cultivars as a Comprehensive Information Resource for Marker Assisted Selection , 2015, PloS one.