Restructured GEO: restructuring Gene Expression Omnibus metadata for genome dynamics analysis

Abstract Motivation Gene Expression Omnibus (GEO) and other publicly available data store their metadata in the format of unstructured English text, which is very difficult for automated reuse. Results We employed text mining techniques to analyze the metadata of GEO and developed Restructured GEO database (ReGEO). ReGEO reorganizes and categorizes GEO series and makes them searchable by two new attributes extracted automatically from each series’ metadata. These attributes are the number of time points tested in the experiment and the disease being investigated. ReGEO also makes series searchable by other attributes available in GEO, such as platform organism, experiment type, associated PubMed ID as well as general keywords in the study’s description. Our approach greatly expands the usability of GEO data, demonstrating a credible approach to improve the utility of vast amount of publicly available data in the era of Big Data research.

[1]  David Wheeler,et al.  Building Customized Data Pipelines Using the Entrez Programming Utilities (eUtils) , 2004 .

[2]  Gang Feng,et al.  Disease Ontology: a backbone for disease semantic integration , 2011, Nucleic Acids Res..

[3]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[4]  Yidong Chen,et al.  GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus , 2008, Bioinform..

[5]  Daniel Toro-Domínguez,et al.  ImaGEO: integrative gene expression meta-analysis from GEO database , 2018, Bioinform..

[6]  Spyros Kotoulas,et al.  Massively Parallel Reasoning under the Well-Founded Semantics Using X10 , 2014, 2014 IEEE 26th International Conference on Tools with Artificial Intelligence.

[7]  Halil Kilicoglu,et al.  Augmenting Microarray Data with Literature-Based Knowledge to Enhance Gene Regulatory Network Inference , 2014, PLoS Comput. Biol..

[8]  Cory B. Giles,et al.  ALE: automated label extraction from GEO metadata , 2017, BMC Bioinformatics.

[9]  Michel Dumontier,et al.  Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO) , 2017, J. Biomed. Informatics.

[10]  Xing Qiu,et al.  Controllability and stability analysis of large transcriptomic dynamic systems for host response to influenza infection in human , 2016, Infectious Disease Modelling.

[11]  Katja Koeppen,et al.  ScanGEO: parallel mining of high-throughput gene expression data , 2017, Bioinform..

[12]  W Jim Zheng,et al.  Informatics, Data Science, and Artificial Intelligence. , 2018, JAMA.

[13]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[14]  Teng Wang,et al.  eSplash: Efficient speculation in large scale heterogeneous computing systems , 2016, 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC).

[15]  L. Hood,et al.  Predictive, personalized, preventive, participatory (P4) cancer medicine , 2011, Nature Reviews Clinical Oncology.

[16]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[17]  Elizabeth D. Dalton,et al.  Changes in Data Sharing and Data Reuse Practices and Perceptions among Scientists Worldwide , 2015, PloS one.

[18]  Hulin Wu,et al.  Correlation-based iterative clustering methods for time course data: The identification of temporal gene response modules for influenza infection in humans , 2016, Infectious Disease Modelling.

[19]  Gang Fu,et al.  Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data , 2014, Nucleic Acids Res..

[20]  Muin J Khoury,et al.  A population approach to precision medicine. , 2012, American journal of preventive medicine.

[21]  Nan Deng,et al.  Dynamic transcriptional signatures and network responses for clinical symptoms in influenza-infected human subjects using systems biology approaches , 2014, Journal of Pharmacokinetics and Pharmacodynamics.

[22]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[23]  Hulin Wu,et al.  Sparse Additive Ordinary Differential Equations for Dynamic Gene Regulatory Network Modeling , 2014, Journal of the American Statistical Association.

[24]  Dennis B. Troup,et al.  NCBI GEO: archive for functional genomics data sets—10 years on , 2010, Nucleic Acids Res..

[25]  Zhao Li,et al.  GEOMetaCuration: a web-based application for accurate manual curation of Gene Expression Omnibus metadata , 2018, bioRxiv.

[26]  Hua Xu,et al.  Using Ontology Fingerprints to disambiguate gene name entities in the biomedical literature , 2015, ICBO.

[27]  Sean R. Davis,et al.  GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor , 2007, Bioinform..

[28]  Dennis B. Troup,et al.  NCBI GEO: archive for high-throughput functional genomic data , 2008, Nucleic Acids Res..

[29]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.