BioDataome: a collection of uniformly preprocessed and automatically annotated datasets for data-driven biology

Abstract Biotechnology revolution generates a plethora of omics data with an exponential growth pace. Therefore, biological data mining demands automatic, ‘high quality’ curation efforts to organize biomedical knowledge into online databases. BioDataome is a database of uniformly preprocessed and disease-annotated omics data with the aim to promote and accelerate the reuse of public data. We followed the same preprocessing pipeline for each biological mart (microarray gene expression, RNA-Seq gene expression and DNA methylation) to produce ready for downstream analysis datasets and automatically annotated them with disease-ontology terms. We also designate datasets that share common samples and automatically discover control samples in case-control studies. Currently, BioDataome includes ∼5600 datasets, ∼260 000 samples spanning ∼500 diseases and can be easily used in large-scale massive experiments and meta-analysis. All datasets are publicly available for querying and downloading via BioDataome web application. We demonstrate BioDataome’s utility by presenting exploratory data analysis examples. We have also developed BioDataome R package found in: https://github.com/mensxmachina/BioDataome/. Database URL: http://dataome.mensxmachina.org/

[1]  Jen Ferguson,et al.  Description and Annotation of Biomedical Data Sets , 2012 .

[2]  Peter Widmayer,et al.  Genevestigator V3: A Reference Expression Database for the Meta-Analysis of Transcriptomes , 2008, Adv. Bioinformatics.

[3]  Rafael A. Irizarry,et al.  Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays , 2014, Bioinform..

[4]  Li Lin,et al.  Corrigendum: Selectively enhanced photocurrent generation in twisted bilayer graphene with van Hove singularity , 2016, Nature Communications.

[5]  Richard Gibson,et al.  Value, but high costs in post-deposition data curation , 2016, Database J. Biol. Databases Curation.

[6]  Levi Waldron,et al.  A reproducible approach to high-throughput biological data acquisition and integration , 2015, PeerJ.

[7]  A. Brazma,et al.  Reuse of public genome-wide gene expression data , 2012, Nature Reviews Genetics.

[8]  A. del Sol,et al.  Prediction of disease–gene–drug relationships following a differential network analysis , 2016, Cell Death and Disease.

[9]  A. Klindworth,et al.  Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies , 2012, Nucleic acids research.

[10]  Kai Li,et al.  Targeted exploration and analysis of large cross-platform human transcriptomic compendia , 2015, Nature Methods.

[11]  Jeyakumar Natarajan,et al.  Overview of the interactive task in BioCreative V , 2015, Database J. Biol. Databases Curation.

[12]  Anne Niknejad,et al.  Uncovering hidden duplicated content in public transcriptomics data , 2013, Database J. Biol. Databases Curation.

[13]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[14]  A. Bird DNA methylation patterns and epigenetic memory. , 2002, Genes & development.

[15]  Robert Petryszak,et al.  ArrayExpress update—simplifying data submissions , 2014, Nucleic Acids Res..

[16]  K. Gunderson,et al.  High density DNA methylation array with single CpG site resolution. , 2011, Genomics.

[17]  Frédéric Baribaud,et al.  Integrating personalized gene expression profiles into predictive disease-associated gene pools , 2017, npj Systems Biology and Applications.

[18]  Rafael A Irizarry,et al.  Frozen robust multiarray analysis (fRMA). , 2010, Biostatistics.

[19]  Marcel Ramos,et al.  The Doppelgänger Effect: Hidden Duplicates in Databases of Transcriptome Profiles. , 2016, Journal of the National Cancer Institute.

[20]  Tanya Barrett,et al.  The Gene Expression Omnibus Database , 2016, Statistical Genomics.

[21]  Euripides G. M. Petrakis,et al.  Towards Personalized Medical Document Classification by Leveraging UMLS Semantic Network , 2013, HIS.

[22]  Rachael P. Huntley,et al.  Gene Ontology annotation of sequence-specific DNA binding transcription factors: setting the stage for a large-scale curation effort , 2013, Database J. Biol. Databases Curation.

[23]  Wei-Chung Cheng,et al.  Microarray meta-analysis database (M2DB): a uniformly pre-processed, quality controlled, and manually curated human clinical microarray database , 2010, BMC Bioinformatics.

[24]  Gang Fu,et al.  Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data , 2014, Nucleic Acids Res..

[25]  D. Joanes,et al.  Comparing measures of sample skewness and kurtosis , 1998 .

[26]  David J. Winter,et al.  rentrez: An R package for the NCBI eUtils API , 2017, R J..

[27]  Jihoon Kim,et al.  Towards large-scale sample annotation in gene expression repositories , 2009, BMC Bioinformatics.

[28]  Xiang Wan,et al.  Gemma: a resource for the reuse, sharing and meta-analysis of expression profiling data , 2012, Bioinform..

[29]  M. Vignuzzi,et al.  ZIKA virus elicits P53 activation and genotoxic stress in human neural progenitors similar to mutations involved in severe forms of genetic microcephaly and p53 , 2016, Cell Death and Disease.

[30]  Kathleen M Jagodnik,et al.  Extraction and analysis of signatures from the Gene Expression Omnibus by the crowd , 2016, Nature Communications.

[31]  Conrad Bessant,et al.  GeoDiver: Differential Gene Expression Analysis & Gene-Set Analysis For GEO Datasets , 2017, bioRxiv.

[32]  Stephen R. Piccolo,et al.  A single-sample microarray normalization method to facilitate personalized-medicine workflows. , 2012, Genomics.

[33]  Andrew D. Rouillard,et al.  The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins , 2016, Database J. Biol. Databases Curation.

[34]  Denis C. Bauer,et al.  A Comparative Study of Techniques for Differential Expression Analysis on RNA-Seq Data , 2014, bioRxiv.

[35]  Weijun Luo,et al.  Pathview: an R/Bioconductor package for pathway-based data integration and visualization , 2013, Bioinform..

[36]  Allyson L. Lister,et al.  BioSharing: curated and crowd-sourced metadata standards, databases and data policies in the life sciences , 2016, Database J. Biol. Databases Curation.

[37]  Daniel J. Gaffney,et al.  A survey of best practices for RNA-seq data analysis , 2016, Genome Biology.

[38]  Matthew N. McCall,et al.  Thawing Frozen Robust Multi-array Analysis (fRMA) , 2011, BMC Bioinformatics.

[39]  Robert Tibshirani,et al.  Statistical methods for identifying differentially expressed genes in DNA microarrays. , 2003, Methods in molecular biology.

[40]  Guangchuang Yu,et al.  clusterProfiler: an R package for comparing biological themes among gene clusters. , 2012, Omics : a journal of integrative biology.

[41]  Guy E. Zinman,et al.  ExpressionBlast: mining large, unstructured expression databases , 2013, Nature Methods.

[42]  M. Strauch,et al.  Statistical methods for identifying differentially expressed genes in cDNA microarray experiments using the R-package SMA , 2003 .

[43]  Jeffrey T Leek,et al.  Reproducible RNA-seq analysis using recount2 , 2017, Nature Biotechnology.

[44]  Hugues Bersini,et al.  inSilicoDb: an R/Bioconductor package for accessing human Affymetrix expert-curated datasets from GEO , 2011, Bioinform..