SnoVault and encodeD: A novel object-based storage system and applications to ENCODE metadata

The Encyclopedia of DNA elements (ENCODE) project is an ongoing collaborative effort to create a comprehensive catalog of functional elements initiated shortly after the completion of the Human Genome Project. The current database exceeds 6500 experiments across more than 450 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the H. sapiens and M. musculus genomes. All ENCODE experimental data, metadata, and associated computational analyses are submitted to the ENCODE Data Coordination Center (DCC) for validation, tracking, storage, unified processing, and distribution to community resources and the scientific community. As the volume of data increases, the identification and organization of experimental details becomes increasingly intricate and demands careful curation. The ENCODE DCC has created a general purpose software system, known as SnoVault, that supports metadata and file submission, a database used for metadata storage, web pages for displaying the metadata and a robust API for querying the metadata. The software is fully open-source, code and installation instructions can be found at: http://github.com/ENCODE-DCC/snovault/ (for the generic database) and http://github.com/ENCODE-DCC/encoded/ to store genomic data in the manner of ENCODE. The core database engine, SnoVault (which is completely independent of ENCODE, genomic data, or bioinformatic data) has been released as a separate Python package.

[1]  Donna K Arnett,et al.  The effects of omega-3 polyunsaturated fatty acids and genetic variants on methylation levels of the interleukin-6 gene promoter. , 2016, Molecular nutrition & food research.

[2]  Peter J. Bickel,et al.  Comparative analysis of regulatory information and circuits across distant species , 2014, Nature.

[3]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[4]  K. Ekwall,et al.  Epigenetics, chromatin and genome organization: recent advances from the ENCODE project , 2014, Journal of internal medicine.

[5]  Dmitrij Frishman,et al.  Differential expression analysis of human endogenous retroviruses based on ENCODE RNA-seq data , 2015, BMC Medical Genomics.

[6]  Shane J. Neph,et al.  A comparative encyclopedia of DNA elements in the mouse genome , 2014, Nature.

[7]  J. Michael Cherry,et al.  ENCODE data at the ENCODE portal , 2015, Nucleic Acids Res..

[8]  Adam G Diehl,et al.  Deciphering ENCODE. , 2016, Trends in genetics : TIG.

[9]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[10]  Philip Cayting,et al.  An encyclopedia of mouse DNA elements (Mouse ENCODE) , 2012, Genome Biology.

[11]  Ting Wang,et al.  Track data hubs enable visualization of user-defined genome-wide annotations on the UCSC Genome Browser , 2013, Bioinform..

[12]  Peter J. Bickel,et al.  Comparative Analysis of the Transcriptome across Distant Species , 2014, Nature.

[13]  I. Amit,et al.  Comprehensive mapping of long range interactions reveals folding principles of the human genome , 2011 .

[14]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[15]  J. Michael Cherry,et al.  Ontology application and use at the ENCODE DCC , 2015, Database J. Biol. Databases Curation.

[16]  W. Sung,et al.  ChIA-PET tool for comprehensive chromatin interaction analysis with paired-end tag sequencing , 2010, Genome Biology.

[17]  Z. Weng,et al.  Principles of regulatory information conservation between mouse and human , 2014, Nature.

[18]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[19]  James B. Brown,et al.  Lessons from modENCODE. , 2015, Annual review of genomics and human genetics.

[20]  David Haussler,et al.  ENCODE Data in the UCSC Genome Browser: year 5 update , 2012, Nucleic Acids Res..

[21]  J. Michael Cherry,et al.  Principles of metadata organization at the ENCODE data coordination center , 2016, Database J. Biol. Databases Curation.

[22]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[23]  Yan Ren,et al.  Insights from ENCODE on Missing Proteins: Why β-Defensin Expression Is Scarcely Detected. , 2015, Journal of proteome research.

[24]  M. Pazin Using the ENCODE Resource for Functional Annotation of Genetic Variants. , 2015, Cold Spring Harbor protocols.

[25]  Moritz Herrmann,et al.  Comparative analysis of metazoan chromatin organization , 2014, Nature.

[26]  D. Karolchik,et al.  The UCSC Genome Browser database: 2016 update , 2015, bioRxiv.

[27]  Nathan Boley,et al.  Navigating and mining modENCODE data. , 2014, Methods.

[28]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[29]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[30]  Andrew M. Jenkinson,et al.  The EBI RDF platform: linked open data for the life sciences , 2014, Bioinform..

[31]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[32]  Kevin Y. Yip,et al.  Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors , 2012, Genome Biology.

[33]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[34]  Toshiro K. Ohsumi,et al.  Genome-wide identification of polycomb-associated RNAs by RIP-seq. , 2010, Molecular cell.

[35]  David Z. Chen,et al.  Architecture of the human regulatory network derived from ENCODE data , 2012, Nature.

[36]  Jeff Vierstra,et al.  Genomic footprinting , 2016, Nature Methods.

[37]  Raymond K. Auerbach,et al.  A User's Guide to the Encyclopedia of DNA Elements (ENCODE) , 2011, PLoS biology.

[38]  William Stafford Noble,et al.  Integrative annotation of chromatin elements from ENCODE data , 2012, Nucleic acids research.

[39]  William Stafford Noble,et al.  Comparative analysis of metazoan chromatin , 2014 .