BgeeDB, an R package for retrieval of curated expression datasets and for gene list expression localization enrichment tests

BgeeDB is a collection of functions to import into R re-annotated, quality-controlled and re-processed expression data available in the Bgee database. This includes data from thousands of wild-type healthy samples of multiple animal species, generated with different gene expression technologies (RNA-seq, Affymetrix microarrays, expressed sequence tags, and in situ hybridizations). BgeeDB facilitates downstream analyses, such as gene expression analyses with other Bioconductor packages. Moreover, BgeeDB includes a new gene set enrichment test for preferred localization of expression of genes in anatomical structures (“TopAnat”). Along with the classical Gene Ontology enrichment test, this test provides a complementary way to interpret gene lists. Availability: https://www.bioconductor.org/packages/BgeeDB/

[1]  Rafael A. Irizarry,et al.  A Model-Based Background Adjustment for Oligonucleotide Expression Arrays , 2004 .

[2]  Judith A. Blake,et al.  Unification of multi-species vertebrate anatomy ontologies for comparative biology in Uberon , 2014, Journal of Biomedical Semantics.

[3]  Marc Robinson-Rechavi,et al.  IQRray, a new method for Affymetrix microarray quality control, and the homologous organ conservation score, a new benchmark method for quality control metrics , 2014, Bioinform..

[4]  S. Bergmann,et al.  The evolution of gene expression levels in mammalian organs , 2011, Nature.

[5]  M. Robinson‐Rechavi,et al.  What to compare and how: Comparative transcriptomics for Evo‐Devo , 2015, Journal of experimental zoology. Part B, Molecular and developmental evolution.

[6]  Nuno A. Fonseca,et al.  Expression Atlas update—an integrated database of gene and protein expression in humans, animals and plants , 2015, Nucleic Acids Res..

[7]  Leonardo Collado-Torres,et al.  recount: A large-scale resource of analysis-ready RNA-seq expression data , 2016, bioRxiv.

[8]  Dmitri D. Pervouchine,et al.  The human transcriptome across tissues and individuals , 2015, Science.

[9]  Astrid Gall,et al.  Ensembl 2018 , 2017, Nucleic Acids Res..

[10]  S. Lewis,et al.  Uberon, an integrative multi-species anatomy ontology , 2012, Genome Biology.

[11]  Alyssa C. Frazee,et al.  ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets , 2011, BMC Bioinformatics.

[12]  Monte Westerfield,et al.  ZFIN, the Zebrafish Model Organism Database: increased support for mutants and transgenics , 2012, Nucleic Acids Res..

[13]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[14]  Wei-Min Liu,et al.  Robust estimators for expression analysis , 2002, Bioinform..

[15]  Günter P. Wagner,et al.  Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples , 2012, Theory in Biosciences.

[16]  Robert D. Finn,et al.  Ensembl Genomes 2018: an integrated omics infrastructure for non-vertebrate species , 2017, Nucleic Acids Res..

[17]  Xiang Wan,et al.  Sharing and Reusing Gene Expression Profiling Data in Neuroscience , 2007, Neuroinformatics.

[18]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[19]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[20]  Igor Jurisica,et al.  Integrated interactions database: tissue-specific view of the human and model organism interactomes , 2015, Nucleic Acids Res..

[21]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[22]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[23]  Ronald W Davis,et al.  A genome-wide study of gene activity reveals developmental signaling pathways in the preimplantation mouse embryo. , 2004, Developmental cell.

[24]  Daniel Marbach,et al.  Tissue-specific regulatory circuits reveal variable modular perturbations across complex diseases , 2016, Nature Methods.

[25]  Christophe Dessimoz,et al.  The Gene Ontology Handbook , 2017, Methods in Molecular Biology.

[26]  Anne Niknejad,et al.  Uncovering hidden duplicated content in public transcriptomics data , 2013, Database J. Biol. Databases Curation.

[27]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[28]  Andrea Komljenovic,et al.  BgeeDB, an R package for retrieval of curated expression datasets and for gene list expression localization enrichment tests , 2016, F1000Research.

[29]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[30]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[31]  David R. O'Brien,et al.  Cell Type-Specific Expression Analysis to Identify Putative Cellular Mechanisms for Neurogenetic Disorders , 2014, The Journal of Neuroscience.

[32]  Piotr Sliz,et al.  A Quick Guide to Software Licensing for the Scientist-Programmer , 2012, PLoS Comput. Biol..

[33]  Sébastien Moretti,et al.  Bgee: Integrating and Comparing Heterogeneous Transcriptome Data Among Species , 2008, DILS.

[34]  Matthias E. Futschik,et al.  Noise-robust Soft Clustering of Gene Expression Time-course Data , 2005, J. Bioinform. Comput. Biol..

[35]  C. Ball,et al.  Repeatability of published microarray gene expression analyses , 2009, Nature Genetics.

[36]  C. Burge,et al.  Evolutionary Dynamics of Gene and Isoform Regulation in Mammalian Tissues , 2012, Science.

[37]  Andrew D. Rouillard,et al.  Enrichr: a comprehensive gene set enrichment analysis web server 2016 update , 2016, Nucleic Acids Res..

[38]  A. Brazma,et al.  Reuse of public genome-wide gene expression data , 2012, Nature Reviews Genetics.

[39]  Juancarlos Chan,et al.  Tissue enrichment analysis for C. elegans genomics , 2016, BMC Bioinformatics.

[40]  C. Deane,et al.  Protein Interactions , 2002, Molecular & Cellular Proteomics.

[41]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[42]  Frederic B. Bastian,et al.  Homolonto: generating homology relationships by pairwise alignment of ontologies and application to vertebrate anatomy , 2010, Bioinform..

[43]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[44]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[45]  Krzysztof J. Szkop,et al.  Multiple sources of bias confound functional enrichment analysis of global -omics data , 2015, Genome Biology.

[46]  Raphael Gottardo,et al.  Orchestrating high-throughput genomic analysis with Bioconductor , 2015, Nature Methods.

[47]  Miho Nakajima,et al.  Analytical approaches to RNA profiling data for the identification of genes enriched in specific cells , 2010, Nucleic acids research.

[48]  Takeya Kasukawa,et al.  Quantitative Expression Profile of Distinct Functional Regions in the Adult Mouse Brain , 2011, PloS one.

[49]  G. Spudich,et al.  Disease and Phenotype Data at Ensembl , 2011, Current protocols in human genetics.

[50]  K. Dolinski,et al.  Use and misuse of the gene ontology annotations , 2008, Nature Reviews Genetics.

[51]  Rasko Leinonen,et al.  The sequence read archive: explosive growth of sequencing data , 2011, Nucleic Acids Res..

[52]  Yoav Gilad,et al.  A reanalysis of mouse ENCODE comparative gene expression data , 2015, F1000Research.

[53]  Thomas Lengauer,et al.  Improved scoring of functional groups from gene expression data by decorrelating GO graph structure , 2006, Bioinform..

[54]  Robert Petryszak,et al.  ArrayExpress update—simplifying data submissions , 2014, Nucleic Acids Res..

[55]  Jeffrey T Leek,et al.  On the design and analysis of gene expression studies in human populations , 2007, Nature Genetics.

[56]  Sean R. Davis,et al.  SRAdb: query and use public next-generation sequencing data from within R , 2013, BMC Bioinformatics.

[57]  Sean R. Davis,et al.  GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor , 2007, Bioinform..

[58]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.

[59]  Audrey Kauffmann,et al.  Importing ArrayExpress datasets into R/Bioconductor , 2009, Bioinform..

[60]  J. Thornton,et al.  Correcting for sequence biases in present/absent calls , 2007, Genome Biology.

[61]  Kimberly Van Auken,et al.  WormBase 2017: molting into a new stage , 2017, Nucleic Acids Res..