Nonnegative Tensor Factorization of Biomedical Literature for Analysis of Genomic Data

Rapid growth of the biomedical literature related to genes and molecular pathways presents a serious challenge for interpretation of genomic data. Previous work has focused on using singular value decomposition (SVD) and nonnegative matrix factorization (NMF) to extract gene relationships from Medline abstracts. However, these methods work for two dimensional data. Here, we explore the utility of nonnegative tensor factorization to extract semantic relationships between genes and the transcription factors (TFs) that regulate them, using a previously published microarray dataset. A tensor was generated for a group of 86 interferon stimulated genes, 409 TFs, and 2325 terms extracted from shared Medline abstracts. Clusters of terms, genes and TFs were evaluated at various k. For this dataset, certain genes (Il6 and Jak2) and TFs (Stat3, Stat2 and Irf3) were top ranking across most ks along with terms such as activation, interferon, cell and signaling. Further examination of several clusters, using gene pathway databases as well as natural language processing tools, revealed that nonnegative tensor factorization accurately identified genes and TFs in well established signaling pathways. For example, the method identified genes and TFs in the interferon/Toll receptor pathway with high average precision (0.695–0.938) across multiple ks. In addition, the method revealed gene-TF clusters that were not well documented, perhaps pointing to new discoveries. Taken together, this work provides proof-of-concept that nonnegative tensor factorization could be useful in interpretation of genomic data.

[1]  Rob Jelier,et al.  CoPub Mapper: mining MEDLINE based on search term co-publication , 2005, BMC Bioinformatics.

[2]  Hao Chen,et al.  Content-rich biological network constructed by mining PubMed abstracts , 2004, BMC Bioinformatics.

[3]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[4]  Max Welling,et al.  Positive tensor factorization , 2001, Pattern Recognit. Lett..

[5]  Michael W. Berry,et al.  Discovering gene functional relationships using FAUN (Feature Annotation Using Nonnegative matrix factorization) , 2010, BMC Bioinformatics.

[6]  H. Kuo,et al.  Enhancement of caffeic acid phenethyl ester on all-trans retinoic acid-induced differentiation in human leukemia HL-60 cells. , 2006, Toxicology and applied pharmacology.

[7]  Bülent Yener,et al.  Modeling and Multiway Analysis of Chatroom Tensors , 2005, ISI.

[8]  K. Kretschmer,et al.  Retinoic acid can enhance conversion of naive into regulatory T cells independently of secreted cytokines , 2009, The Journal of experimental medicine.

[9]  Michael W. Berry,et al.  Gene clustering by Latent Semantic Indexing of MEDLINE abstracts , 2005, Bioinform..

[10]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[11]  D. J. Carrigan,et al.  Role of Nuclear Factor-κB in the Antiviral Action of Interferon and Interferon-regulated Gene Expression* , 2004, Journal of Biological Chemistry.

[12]  H. Hauser,et al.  Distinct modes of action applied by transcription factors STAT1 and IRF1 to initiate transcription of the IFN-γ-inducible gbp2 gene , 2007, Proceedings of the National Academy of Sciences.

[13]  S. Mckercher,et al.  Murine GBP-2: a new IFN-gamma-induced member of the GBP family of GTPases isolated from macrophages. , 1998, Journal of interferon & cytokine research : the official journal of the International Society for Interferon and Cytokine Research.

[14]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[15]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[16]  T. Barbui,et al.  Stat1 is induced and activated by all-trans retinoic acid in acute promyelocytic leukemia cells. , 1997, Blood.

[17]  Kevin Erich Heinrich,et al.  Automated Gene Classification using Nonnegative Matrix Factorization on Biomedical Literature , 2007 .

[18]  Michael W. Berry,et al.  Discussion Tracking in Enron Email using PARAFAC. , 2008 .

[19]  Neil R. Smalheiser,et al.  A Quantitative Model for Linking Two Disparate Sets of Articles in Medline , 2022 .

[20]  Amnon Shashua,et al.  Linear image coding for regression and classification using the tensor-rank principle , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[21]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001 .

[22]  Tamara G. Kolda,et al.  MATLAB Tensor Toolbox , 2006 .

[23]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[24]  Tamir Hazan,et al.  Non-negative tensor factorization with applications to statistics and computer vision , 2005, ICML.

[25]  Andrey A. Puretskiy,et al.  Scenario Discovery Using Nonnegative Tensor Factorization , 2008, CIARP.

[26]  Hagit Shatkay,et al.  Discovering semantic features in the literature: a foundation for building functional associations , 2006, BMC Bioinformatics.

[27]  Efstratios Gallopoulos,et al.  TMG: A MATLAB Toolbox for Generating Term-Document Matrices from Text Collections , 2006, Grouping Multidimensional Data.

[28]  Papa S Diaw,et al.  SPARSE TENSORS DECOMPOSITION SOFTWARE , 2010 .

[29]  Jonathan D. Wren,et al.  Knowledge discovery by automated identification and ranking of implicit relationships , 2004, Bioinform..

[30]  A. Roy,et al.  Pro-proliferative function of the long isoform of PML-RARα involved in acute promyelocytic leukemia , 2006, Oncogene.

[31]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..