Prevention of data duplication for high throughput sequencing repositories

Abstract Prevention of unintended duplication is one of the ongoing challenges many databases have to address. Working with high-throughput sequencing data, the complexity of that challenge increases with the complexity of the definition of a duplicate. In a computational data model, a data object represents a real entity like a reagent or a biosample. This representation is similar to how a card represents a book in a paper library catalog. Duplicated data objects not only waste storage, they can mislead users into assuming the model represents more than the single entity. Even if it is clear that two objects represent a single entity, data duplication opens the door to potential inconsistencies between the objects since the content of the duplicated objects can be updated independently, allowing divergence of the metadata associated with the objects. Analogously to a situation in which a catalog in a paper library would contain by mistake two cards for a single copy of a book. If these cards are listing simultaneously two different individuals as current book borrowers, it would be difficult to determine which borrower (out of the two listed) actually has the book. Unfortunately, in a large database with multiple submitters, unintended duplication is to be expected. In this article, we present three principal guidelines the Encyclopedia of DNA Elements (ENCODE) Portal follows in order to prevent unintended duplication of both actual files and data objects: definition of identifiable data objects (I), object uniqueness validation (II) and de-duplication mechanism (III). In addition to explaining our modus operandi, we elaborate on the methods used for identification of sequencing data files. Comparison of the approach taken by the ENCODE Portal vs other widely used biological data repositories is provided. Database URL: https://www.encodeproject.org/

[1]  Piero Carninci,et al.  On-the-fly selection of cell-specific enhancers, genes, miRNAs and proteins across the human body using SlideBase , 2016, Database J. Biol. Databases Curation.

[2]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[3]  Dennis B. Troup,et al.  NCBI GEO: archive for functional genomics data sets—10 years on , 2010, Nucleic Acids Res..

[4]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[5]  J. Michael Cherry,et al.  Principles of metadata organization at the ENCODE data coordination center , 2016, Database J. Biol. Databases Curation.

[6]  L. Staudt,et al.  The NCI Genomic Data Commons as an engine for precision medicine. , 2017, Blood.

[7]  Sergio Contrino,et al.  The modENCODE Data Coordination Center: lessons in harvesting comprehensive experimental details , 2011, Database J. Biol. Databases Curation.

[8]  S. Linnarsson,et al.  Counting absolute numbers of molecules using unique molecular identifiers , 2011, Nature Methods.

[9]  T. Mikkelsen,et al.  The NIH Roadmap Epigenomics Mapping Consortium , 2010, Nature Biotechnology.

[10]  Chris Morris,et al.  Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data , 2017, bioRxiv.

[11]  M. Uhlén,et al.  Antibodypedia, a Portal for Sharing Antibody and Antigen Validation Data*S , 2008, Molecular & Cellular Proteomics.

[12]  J. Michael Cherry,et al.  ENCODE data at the ENCODE portal , 2015, Nucleic Acids Res..