Discovery of perturbation gene targets via free text metadata mining in Gene Expression Omnibus

There exists over 2.5 million publicly available gene expression samples across 101,000 data series in NCBI’s Gene Expression Omnibus (GEO) database. Due to the lack of the use of standardised ontology terms in GEO’s free text metadata to annotate the experimental type and sample type, this database remains difficult to harness computationally without significant manual intervention. In this work, we present an interactive R/Shiny tool called GEOracle that utilises text mining and machine learning techniques to automatically identify perturbation experiments, group treatment and control samples and perform differential expression. We present applications of GEOracle to discover conserved signalling pathway target genes and identify an organ specific gene regulatory network. GEOracle is effective in discovering perturbation gene targets in GEO by harnessing its free text metadata. Its effectiveness and applicability has been demonstrated by cross validation and two real-life case studies. It opens up new avenues to unlock the gene regulatory information embedded inside large biological databases such as GEO. GEOracle is available at https://github.com/VCCRI/GEOracle.

[1]  Sean R. Davis,et al.  GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor , 2007, Bioinform..

[2]  Bertram Klinger,et al.  Discovering causal signaling pathways through gene-expression patterns , 2010, Nucleic Acids Res..

[3]  Djordje Djordjevic,et al.  How Difficult Is Inference of Mammalian Causal Gene Regulatory Networks? , 2014, PloS one.

[4]  Guy E. Zinman,et al.  ExpressionBlast: mining large, unstructured expression databases , 2013, Nature Methods.

[5]  Xia Li,et al.  Gene Perturbation Atlas (GPA): a single-gene perturbation repository for characterizing functional mechanisms of coding and non-coding genes , 2015, Scientific Reports.

[6]  Jason E. Stewart,et al.  Minimum information about a microarray experiment (MIAME)—toward standards for microarray data , 2001, Nature Genetics.

[7]  Yidong Chen,et al.  GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus , 2008, Bioinform..

[8]  Hedi Peterson,et al.  g:Profiler—a web-based toolset for functional profiling of gene lists from large-scale experiments , 2007, Nucleic Acids Res..

[9]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[10]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[11]  Eugenia Galeota,et al.  Ontology-based annotations and semantic relations in large-scale (epi)genomics data , 2016, Briefings Bioinform..

[12]  A. Brazma,et al.  Reuse of public genome-wide gene expression data , 2012, Nature Reviews Genetics.

[13]  J. Sáez-Rodríguez,et al.  Perturbation-response genes reveal signaling footprints in cancer gene expression , 2016, Nature Communications.

[14]  Antoine H. C. van Kampen,et al.  compendiumdb: an R package for retrieval and storage of functional genomics data , 2016, Bioinform..

[15]  Garrett M. Dancik,et al.  shinyGEO: a web-based application for analyzing gene expression omnibus datasets , 2016, Bioinform..