Gene Set to Diseases (GS2D): disease enrichment analysis on human gene sets with literature data

Large sets of candidate genes derived from high-throughput biological experiments can be characterized by functional enrichment analysis. The analysis consists of comparing the functions of one gene set against that of a background gene set. Then, functions related to a significant number of genes in the gene set are expected to be relevant. Web tools offering disease enrichment analysis on gene sets are often based on gene-disease associations from manually curated or experimental data that is accurate but does not cover all diseases discussed in the literature. Using associations automatically derived from literature data could be a cost effective method to improve the coverage of diseases for enrichment analysis at comparable levels of accuracy. We have implemented a method named Gene set to Diseases, GS2D, as a web tool performing disease enrichment analysis on human protein coding gene sets. It uses an automatically built dataset of more than 63 thousand gene-disease associations defined as statistically significant co-occurrences of genes and diseases in annotations of biomedical citations from PubMed. The dataset covers more diseases for enrichment analysis than the largest comparable curated database (Comparative Toxicogenomics Database) and its performance compared favourably to similar approaches based on manually curated or experimental data. Graphical and programmatic interfaces are available at http://cbdm.uni-mainz.de/geneset2diseases.

[1]  Zhiyong Lu,et al.  GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains , 2015, BioMed research international.

[2]  J. Trojanowski,et al.  An Alzheimer’s Disease-Derived Biomarker Signature Identifies Parkinson’s Disease Patients with Dementia , 2016, PloS one.

[3]  Yasuko Takahashi,et al.  VaDE: a manually curated database of reproducible associations between various traits and human genomic polymorphisms , 2014, Nucleic Acids Res..

[4]  Prudence Mutowo-Meullenet,et al.  The GOA database: Gene Ontology annotation updates for 2015 , 2014, Nucleic Acids Res..

[5]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[6]  Ryan Miller,et al.  WikiPathways: capturing the full diversity of pathway knowledge , 2015, Nucleic Acids Res..

[7]  J. Schwartz,et al.  The Expanded p53 Interactome as a Predictive Model for Cancer Therapy , 2015 .

[8]  Theodoros G. Soldatos,et al.  How to learn about gene function: text-mining or ontologies? , 2015, Methods.

[9]  Lin Gao,et al.  HPOSim: An R Package for Phenotypic Similarity Measure and Enrichment Analysis Based on the Human Phenotype Ontology , 2015, PloS one.

[10]  B. Vogelstein,et al.  p53 mutations in human cancers. , 1991, Science.

[11]  A. Bauer-Mehren,et al.  Gene-Disease Network Analysis Reveals Functional Modules in Mendelian, Complex and Environmental Diseases , 2011, PloS one.

[12]  Jing Chen,et al.  ToppGene Suite for gene list enrichment analysis and candidate gene prioritization , 2009, Nucleic Acids Res..

[13]  Qing-Yu He,et al.  DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis , 2015, Bioinform..

[14]  K. Fink,et al.  Infrequency of p53 gene mutations in ependymomas , 1996, Journal of Neuro-Oncology.

[15]  R. Gandour-Edwards,et al.  Biologic markers of invasive pituitary adenomas involving the sphenoid sinus. , 1995, Modern pathology : an official journal of the United States and Canadian Academy of Pathology, Inc.

[16]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[17]  Zhiyong Lu,et al.  The gene normalization task in BioCreative III , 2011, BMC Bioinformatics.

[18]  F. Sanz,et al.  A Knowledge-Driven Approach to Extract Disease-Related Biomarkers from the Literature , 2014, BioMed research international.

[19]  A. Barabasi,et al.  Uncovering disease-disease relationships through the incomplete interactome , 2015, Science.

[20]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[21]  Hans-Peter Kriegel,et al.  Extraction of semantic biomedical relations from text using conditional random fields , 2008, BMC Bioinformatics.

[22]  H. Abe,et al.  Rare occurrence of inactivating p53 gene mutations in primary non-astrocytic tumors of the central nervous system: reappraisal by yeast functional assay , 1998, Acta Neuropathologica.

[23]  Peggy Hall,et al.  The NHGRI GWAS Catalog, a curated resource of SNP-trait associations , 2013, Nucleic Acids Res..

[24]  Miguel A. Andrade-Navarro,et al.  Génie: literature-based gene prioritization at multi genomic scale , 2011, Nucleic Acids Res..

[25]  Thomas C. Wiegers,et al.  The Comparative Toxicogenomics Database's 10th year anniversary: update 2015 , 2014, Nucleic Acids Res..

[26]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .