Functional Cohesion of Gene Sets Determined by Latent Semantic Indexing of PubMed Abstracts

High-throughput genomic technologies enable researchers to identify genes that are co-regulated with respect to specific experimental conditions. Numerous statistical approaches have been developed to identify differentially expressed genes. Because each approach can produce distinct gene sets, it is difficult for biologists to determine which statistical approach yields biologically relevant gene sets and is appropriate for their study. To address this issue, we implemented Latent Semantic Indexing (LSI) to determine the functional coherence of gene sets. An LSI model was built using over 1 million Medline abstracts for over 20,000 mouse and human genes annotated in Entrez Gene. The gene-to-gene LSI-derived similarities were used to calculate a literature cohesion p-value (LPv) for a given gene set using a Fisher's exact test. We tested this method against genes in more than 6,000 functional pathways annotated in Gene Ontology (GO) and found that approximately 75% of gene sets in GO biological process category and 90% of the gene sets in GO molecular function and cellular component categories were functionally cohesive (LPv<0.05). These results indicate that the LPv methodology is both robust and accurate. Application of this method to previously published microarray datasets demonstrated that LPv can be helpful in selecting the appropriate feature extraction methods. To enable real-time calculation of LPv for mouse or human gene sets, we developed a web tool called Gene-set Cohesion Analysis Tool (GCAT). GCAT can complement other gene set enrichment approaches by determining the overall functional cohesion of data sets, taking into account both explicit and implicit gene interactions reported in the biomedical literature. Availability GCAT is freely available at http://binf1.memphis.edu/gcat

[1]  Joaquín Dopazo,et al.  SNOW, a web-based tool for the statistical analysis of protein–protein interaction networks , 2009, Nucleic Acids Res..

[2]  José María Carazo,et al.  Assessment of protein set coherence using functional annotations , 2008, BMC Bioinformatics.

[3]  R. Altman,et al.  Using text analysis to identify functionally coherent gene groups. , 2002, Genome research.

[4]  Golan Yona,et al.  Comparing algorithms for clustering of expression data: how to assess gene clusters. , 2009, Methods in molecular biology.

[5]  Junguk Hur,et al.  A graph-theoretic modeling on GO space for biological interpretation of gene clusters , 2004, Bioinform..

[6]  Seon-Young Kim,et al.  Gene-set approach for expression pattern analysis , 2008, Briefings Bioinform..

[7]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[8]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[9]  Tae-Min Kim,et al.  Inferring biological functions and associated transcriptional regulators using gene set expression coherence analysis , 2007, BMC Bioinformatics.

[10]  Ian B. Jeffery,et al.  Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data , 2006, BMC Bioinformatics.

[11]  Cliff Joslyn,et al.  The Gene Ontology Categorizer , 2004, ISMB/ECCB.

[12]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Bin Zheng,et al.  Using protein-semantic network metrics to evaluate functional coherence of protein groups. , 2007, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[14]  R. Gentleman,et al.  Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. , 2004, Blood.

[15]  Jonathan D. Wren,et al.  Knowledge discovery by automated identification and ranking of implicit relationships , 2004, Bioinform..

[16]  Zhen Jiang,et al.  Gene set enrichment analysis using linear models and diagnostics , 2008, Bioinform..

[17]  W. Pan,et al.  How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach , 2002, Genome Biology.

[18]  M. Schummer,et al.  Selecting Differentially Expressed Genes from Microarray Experiments , 2003, Biometrics.

[19]  Darrell Laham,et al.  From paragraph to graph: Latent semantic analysis for information visualization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Barend Mons,et al.  Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation , 2007, BMC Bioinformatics.

[21]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 1999 .

[22]  David Martin,et al.  GOToolBox: functional analysis of gene datasets based on Gene Ontology , 2004, Genome Biology.

[23]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[24]  Yudi Pawitan,et al.  False discovery rate, sensitivity and sample size for microarray studies , 2005, Bioinform..

[25]  Joaquín Dopazo,et al.  FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes , 2004, Bioinform..

[26]  Ramin Homayouni,et al.  Bioinformatic analysis reveals cRel as a regulator of a subset of interferon-stimulated genes. , 2008, Journal of interferon & cytokine research : the official journal of the International Society for Interferon and Cytokine Research.

[27]  Russ B. Altman,et al.  A literature-based method for assessing the functional coherence of a gene group , 2003, Bioinform..

[28]  John D. Storey A direct approach to false discovery rates , 2002 .

[29]  Zhen Jiang,et al.  Bioconductor Project Bioconductor Project Working Papers Year Paper Extensions to Gene Set Enrichment , 2013 .

[30]  Barend Mons,et al.  Assignment of protein function and discovery of novel nucleolar proteins based on automatic analysis of MEDLINE , 2007, Proteomics.

[31]  Bing Zhang,et al.  WebGestalt: an integrated system for exploring gene sets in various biological contexts , 2005, Nucleic Acids Res..

[32]  Xinghua Lu,et al.  Novel metrics for evaluating the functional coherence of protein groups via protein semantic network , 2007, Genome Biology.

[33]  Michael W. Berry,et al.  Gene clustering by Latent Semantic Indexing of MEDLINE abstracts , 2005, Bioinform..

[34]  Martijn J. Schuemie,et al.  Novel Protein-Protein Interactions Inferred from Literature Context , 2009, PloS one.