Latent Semantic Indexing of PubMed abstracts for identification of transcription factor candidates from microarray derived gene sets

BackgroundIdentification of transcription factors (TFs) responsible for modulation of differentially expressed genes is a key step in deducing gene regulatory pathways. Most current methods identify TFs by searching for presence of DNA binding motifs in the promoter regions of co-regulated genes. However, this strategy may not always be useful as presence of a motif does not necessarily imply a regulatory role. Conversely, motif presence may not be required for a TF to regulate a set of genes. Therefore, it is imperative to include functional (biochemical and molecular) associations, such as those found in the biomedical literature, into algorithms for identification of putative regulatory TFs that might be explicitly or implicitly linked to the genes under investigation.ResultsIn this study, we present a Latent Semantic Indexing (LSI) based text mining approach for identification and ranking of putative regulatory TFs from microarray derived differentially expressed genes (DEGs). Two LSI models were built using different term weighting schemes to devise pair-wise similarities between 21,027 mouse genes annotated in the Entrez Gene repository. Amongst these genes, 433 were designated TFs in the TRANSFAC database. The LSI derived TF-to-gene similarities were used to calculate TF literature enrichment p-values and rank the TFs for a given set of genes. We evaluated our approach using five different publicly available microarray datasets focusing on TFs Rel, Stat6, Ddit3, Stat5 and Nfic. In addition, for each of the datasets, we constructed gold standard TFs known to be functionally relevant to the study in question. Receiver Operating Characteristics (ROC) curves showed that the log-entropy LSI model outperformed the tf-normal LSI model and a benchmark co-occurrence based method for four out of five datasets, as well as motif searching approaches, in identifying putative TFs.ConclusionsOur results suggest that our LSI based text mining approach can complement existing approaches used in systems biology research to decipher gene regulatory networks by providing putative lists of ranked TFs that might be explicitly or implicitly associated with sets of DEGs derived from microarray experiments. In addition, unlike motif searching approaches, LSI based approaches can reveal TFs that may indirectly regulate genes.

[1]  Leah Barrera,et al.  The transcriptional regulatory code of eukaryotic cells--insights from genome-wide analysis of chromatin organization and transcription factor binding. , 2006, Current opinion in cell biology.

[2]  Jun Hyoung Lee,et al.  Phenotypic engineering by reprogramming gene transcription using novel artificial transcription factors in Escherichia coli , 2008, Nucleic acids research.

[3]  Nicolas Mermod,et al.  Nuclear Factor I-C Links Platelet-Derived Growth Factor and Transforming Growth Factor β1 Signaling to Skin Wound Healing Progression , 2009, Molecular and Cellular Biology.

[4]  Jonathan D. Wren,et al.  Knowledge discovery by automated identification and ranking of implicit relationships , 2004, Bioinform..

[5]  D. J. Carrigan,et al.  Role of Nuclear Factor-κB in the Antiviral Action of Interferon and Interferon-regulated Gene Expression* , 2004, Journal of Biological Chemistry.

[6]  Hamparsum Bozdogan,et al.  Statistical Data Mining and Knowledge Discovery , 2004 .

[7]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 1999 .

[8]  Ramin Homayouni,et al.  Bioinformatic analysis reveals cRel as a regulator of a subset of interferon-stimulated genes. , 2008, Journal of interferon & cytokine research : the official journal of the International Society for Interferon and Cytokine Research.

[9]  Goran Nenadic,et al.  Assigning roles to protein mentions: The case of transcription factors , 2009, J. Biomed. Informatics.

[10]  Michael W. Berry,et al.  GTP (General Text Parser) Software for Text Mining , 2003 .

[11]  Ole Winther,et al.  JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update , 2007, Nucleic Acids Res..

[12]  David Voehringer,et al.  Alternatively activated macrophages inhibit T-cell proliferation by Stat6-dependent expression of PD-L2. , 2010, Blood.

[13]  Shannan J. Ho Sui,et al.  oPOSSUM: integrated tools for analysis of regulatory motif over-representation , 2007, Nucleic Acids Res..

[14]  Julio Collado-Vides,et al.  Automatic reconstruction of a bacterial regulatory network using Natural Language Processing , 2007, BMC Bioinformatics.

[15]  D. Lipman,et al.  National Center for Biotechnology Information , 2019, Springer Reference Medizin.

[16]  Rob Jelier,et al.  CoPub Mapper: mining MEDLINE based on search term co-publication , 2005, BMC Bioinformatics.

[17]  Hao Chen,et al.  Content-rich biological network constructed by mining PubMed abstracts , 2004, BMC Bioinformatics.

[18]  J. Kawai,et al.  A genome-wide and nonredundant mouse transcription factor database. , 2004, Biochemical and biophysical research communications.

[19]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[20]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[21]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[22]  BMC Bioinformatics , 2005 .

[23]  Michael W. Berry,et al.  Gene clustering by Latent Semantic Indexing of MEDLINE abstracts , 2005, Bioinform..

[24]  Dennis B. Troup,et al.  NCBI GEO: archive for high-throughput functional genomic data , 2008, Nucleic Acids Res..

[25]  D. Ron,et al.  CHOP induces death by promoting protein synthesis and oxidation in the stressed endoplasmic reticulum. , 2004, Genes & development.

[26]  Jonathan D. Wren,et al.  Markov model recognition and classification of DNA/protein sequences within large text databases , 2005, Bioinform..

[27]  Vladimir B. Bajic,et al.  Dragon TF Association Miner: a system for exploring transcription factor associations through text-mining , 2004, Nucleic Acids Res..

[28]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[29]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[30]  Jonathan D. Wren,et al.  Clustering microarray-derived gene lists through implicit literature relationships , 2007, Bioinform..

[31]  Welch Bl THE GENERALIZATION OF ‘STUDENT'S’ PROBLEM WHEN SEVERAL DIFFERENT POPULATION VARLANCES ARE INVOLVED , 1947 .

[32]  Peter A. C. 't Hoen,et al.  CORE_TF: a user-friendly interface to identify evolutionary conserved transcription factor binding sites in sets of co-regulated genes , 2008, BMC Bioinformatics.

[33]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[34]  Peer Bork,et al.  Extraction of regulatory gene/protein networks from Medline , 2006, Bioinform..

[35]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[36]  Tae-Min Kim,et al.  Advances in analysis of transcriptional regulatory networks , 2011, Wiley interdisciplinary reviews. Systems biology and medicine.

[37]  L. Hennighausen,et al.  The transcription factors signal transducer and activator of transcription 5A (STAT5A) and STAT5B negatively regulate cell proliferation through the activation of cyclin‐dependent kinase inhibitor 2b (Cdkn2b) and Cdkn1a expression , 2010, Hepatology.

[38]  Thomas Werner,et al.  MatInspector and beyond: promoter analysis based on transcription factor binding sites , 2005, Bioinform..

[39]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.