论文信息 - BioDataome: a collection of uniformly preprocessed and automatically annotated datasets for data-driven biology - 字舞流文

BioDataome: a collection of uniformly preprocessed and automatically annotated datasets for data-driven biology

Abstract Biotechnology revolution generates a plethora of omics data with an exponential growth pace. Therefore, biological data mining demands automatic, ‘high quality’ curation efforts to organize biomedical knowledge into online databases. BioDataome is a database of uniformly preprocessed and disease-annotated omics data with the aim to promote and accelerate the reuse of public data. We followed the same preprocessing pipeline for each biological mart (microarray gene expression, RNA-Seq gene expression and DNA methylation) to produce ready for downstream analysis datasets and automatically annotated them with disease-ontology terms. We also designate datasets that share common samples and automatically discover control samples in case-control studies. Currently, BioDataome includes ∼5600 datasets, ∼260 000 samples spanning ∼500 diseases and can be easily used in large-scale massive experiments and meta-analysis. All datasets are publicly available for querying and downloading via BioDataome web application. We demonstrate BioDataome’s utility by presenting exploratory data analysis examples. We have also developed BioDataome R package found in: https://github.com/mensxmachina/BioDataome/. Database URL: http://dataome.mensxmachina.org/

Kleanthi Lakiotaki | Ioannis Tsamardinos | Nikolaos Vorniotakis | Michail Tsagris | Georgios Georgakopoulos | I. Tsamardinos | M. Tsagris | K. Lakiotaki | G. Georgakopoulos | N. Vorniotakis | Kleanthi Lakiotaki

[1] Jen Ferguson,et al. Description and Annotation of Biomedical Data Sets , 2012 .

[2] Peter Widmayer,et al. Genevestigator V3: A Reference Expression Database for the Meta-Analysis of Transcriptomes , 2008, Adv. Bioinformatics.

[3] Rafael A. Irizarry,et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays , 2014, Bioinform..

[4] Li Lin,et al. Corrigendum: Selectively enhanced photocurrent generation in twisted bilayer graphene with van Hove singularity , 2016, Nature Communications.

[5] Richard Gibson,et al. Value, but high costs in post-deposition data curation , 2016, Database J. Biol. Databases Curation.

[6] Levi Waldron,et al. A reproducible approach to high-throughput biological data acquisition and integration , 2015, PeerJ.

[7] A. Brazma,et al. Reuse of public genome-wide gene expression data , 2012, Nature Reviews Genetics.

[8] A. del Sol,et al. Prediction of disease–gene–drug relationships following a differential network analysis , 2016, Cell Death and Disease.

[9] A. Klindworth,et al. Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies , 2012, Nucleic acids research.

[10] Kai Li,et al. Targeted exploration and analysis of large cross-platform human transcriptomic compendia , 2015, Nature Methods.

[11] Jeyakumar Natarajan,et al. Overview of the interactive task in BioCreative V , 2015, Database J. Biol. Databases Curation.

[12] Anne Niknejad,et al. Uncovering hidden duplicated content in public transcriptomics data , 2013, Database J. Biol. Databases Curation.

[13] Zhiyong Lu,et al. PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[14] A. Bird. DNA methylation patterns and epigenetic memory. , 2002, Genes & development.

[15] Robert Petryszak,et al. ArrayExpress update—simplifying data submissions , 2014, Nucleic Acids Res..

[16] K. Gunderson,et al. High density DNA methylation array with single CpG site resolution. , 2011, Genomics.

[17] Frédéric Baribaud,et al. Integrating personalized gene expression profiles into predictive disease-associated gene pools , 2017, npj Systems Biology and Applications.

[18] Rafael A Irizarry,et al. Frozen robust multiarray analysis (fRMA). , 2010, Biostatistics.

[19] Marcel Ramos,et al. The Doppelgänger Effect: Hidden Duplicates in Databases of Transcriptome Profiles. , 2016, Journal of the National Cancer Institute.

[20] Tanya Barrett,et al. The Gene Expression Omnibus Database , 2016, Statistical Genomics.

[21] Euripides G. M. Petrakis,et al. Towards Personalized Medical Document Classification by Leveraging UMLS Semantic Network , 2013, HIS.

[22] Rachael P. Huntley,et al. Gene Ontology annotation of sequence-specific DNA binding transcription factors: setting the stage for a large-scale curation effort , 2013, Database J. Biol. Databases Curation.

[23] Wei-Chung Cheng,et al. Microarray meta-analysis database (M2DB): a uniformly pre-processed, quality controlled, and manually curated human clinical microarray database , 2010, BMC Bioinformatics.

[24] Gang Fu,et al. Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data , 2014, Nucleic Acids Res..

[25] D. Joanes,et al. Comparing measures of sample skewness and kurtosis , 1998 .

[26] David J. Winter,et al. rentrez: An R package for the NCBI eUtils API , 2017, R J..

[27] Jihoon Kim,et al. Towards large-scale sample annotation in gene expression repositories , 2009, BMC Bioinformatics.

[28] Xiang Wan,et al. Gemma: a resource for the reuse, sharing and meta-analysis of expression profiling data , 2012, Bioinform..

[29] M. Vignuzzi,et al. ZIKA virus elicits P53 activation and genotoxic stress in human neural progenitors similar to mutations involved in severe forms of genetic microcephaly and p53 , 2016, Cell Death and Disease.

[30] Kathleen M Jagodnik,et al. Extraction and analysis of signatures from the Gene Expression Omnibus by the crowd , 2016, Nature Communications.

[31] Conrad Bessant,et al. GeoDiver: Differential Gene Expression Analysis & Gene-Set Analysis For GEO Datasets , 2017, bioRxiv.

[32] Stephen R. Piccolo,et al. A single-sample microarray normalization method to facilitate personalized-medicine workflows. , 2012, Genomics.

[33] Andrew D. Rouillard,et al. The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins , 2016, Database J. Biol. Databases Curation.

[34] Denis C. Bauer,et al. A Comparative Study of Techniques for Differential Expression Analysis on RNA-Seq Data , 2014, bioRxiv.

[35] Weijun Luo,et al. Pathview: an R/Bioconductor package for pathway-based data integration and visualization , 2013, Bioinform..

[36] Allyson L. Lister,et al. BioSharing: curated and crowd-sourced metadata standards, databases and data policies in the life sciences , 2016, Database J. Biol. Databases Curation.

[37] Daniel J. Gaffney,et al. A survey of best practices for RNA-seq data analysis , 2016, Genome Biology.

[38] Matthew N. McCall,et al. Thawing Frozen Robust Multi-array Analysis (fRMA) , 2011, BMC Bioinformatics.

[39] Robert Tibshirani,et al. Statistical methods for identifying differentially expressed genes in DNA microarrays. , 2003, Methods in molecular biology.

[40] Guangchuang Yu,et al. clusterProfiler: an R package for comparing biological themes among gene clusters. , 2012, Omics : a journal of integrative biology.

[41] Guy E. Zinman,et al. ExpressionBlast: mining large, unstructured expression databases , 2013, Nature Methods.

[42] M. Strauch,et al. Statistical methods for identifying differentially expressed genes in cDNA microarray experiments using the R-package SMA , 2003 .

[43] Jeffrey T Leek,et al. Reproducible RNA-seq analysis using recount2 , 2017, Nature Biotechnology.

[44] Hugues Bersini,et al. inSilicoDb: an R/Bioconductor package for accessing human Affymetrix expert-curated datasets from GEO , 2011, Bioinform..