Navigating the Functional Landscape of Transcription Factors via Non-Negative Tensor Factorization Analysis of MEDLINE Abstracts

In this study, we developed and evaluated a novel text-mining approach, using non-negative tensor factorization (NTF), to simultaneously extract and functionally annotate transcriptional modules consisting of sets of genes, transcription factors (TFs), and terms from MEDLINE abstracts. A sparse 3-mode term × gene × TF tensor was constructed that contained weighted frequencies of 106,895 terms in 26,781 abstracts shared among 7,695 genes and 994 TFs. The tensor was decomposed into sub-tensors using non-negative tensor factorization (NTF) across 16 different approximation ranks. Dominant entries of each of 2,861 sub-tensors were extracted to form term–gene–TF annotated transcriptional modules (ATMs). More than 94% of the ATMs were found to be enriched in at least one KEGG pathway or GO category, suggesting that the ATMs are functionally relevant. One advantage of this method is that it can discover potentially new gene–TF associations from the literature. Using a set of microarray and ChIP-Seq datasets as gold standard, we show that the precision of our method for predicting gene–TF associations is significantly higher than chance. In addition, we demonstrate that the terms in each ATM can be used to suggest new GO classifications to genes and TFs. Taken together, our results indicate that NTF is useful for simultaneous extraction and functional annotation of transcriptional regulatory networks from unstructured text, as well as for literature based discovery. A web tool called Transcriptional Regulatory Modules Extracted from Literature (TREMEL), available at http://binf1.memphis.edu/tremel, was built to enable browsing and searching of ATMs.

[1]  Goran Nenadic,et al.  Assigning roles to protein mentions: The case of transcription factors , 2009, J. Biomed. Informatics.

[2]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[3]  Max Welling,et al.  Positive tensor factorization , 2001, Pattern Recognit. Lett..

[4]  Yadong Wang,et al.  Extending gene ontology with gene association networks , 2016, Bioinform..

[5]  Rasmus Bro,et al.  Multi-way Analysis with Applications in the Chemical Sciences , 2004 .

[6]  Andrey A. Puretskiy,et al.  Scenario Discovery Using Nonnegative Tensor Factorization , 2008, CIARP.

[7]  Nadav S. Bar,et al.  Landscape of transcription in human cells , 2012, Nature.

[8]  Steven J. M. Jones,et al.  Text-mining assisted regulatory annotation , 2008, Genome Biology.

[9]  Michael W. Berry,et al.  Discussion Tracking in Enron Email using PARAFAC. , 2008 .

[10]  Mikhail V. Blagosklonny,et al.  Conceptual biology: Unearthing the gems , 2002, Nature.

[11]  Christos Boutsidis,et al.  SVD based initialization: A head start for nonnegative matrix factorization , 2008, Pattern Recognit..

[12]  N. Rajewsky,et al.  The evolution of gene regulation by transcription factors and microRNAs , 2007, Nature Reviews Genetics.

[13]  Veronica Hinman,et al.  The evolution of gene regulation , 2017, eLife.

[14]  Rob Jelier,et al.  CoPub Mapper: mining MEDLINE based on search term co-publication , 2005, BMC Bioinformatics.

[15]  D. Rebholz-Schuhmann,et al.  Text-mining solutions for biomedical research: enabling integrative biology , 2012, Nature Reviews Genetics.

[16]  Magdalena Götz,et al.  The transcription factor Otx2 regulates choroid plexus development and function , 2013, Development.

[17]  Tamara G. Kolda,et al.  Higher-order Web link analysis using multilinear algebra , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[18]  Behrouz Madahian,et al.  Prioritization, clustering and functional annotation of MicroRNAs using latent semantic indexing of MEDLINE abstracts , 2016, BMC Bioinformatics.

[19]  Halil Kilicoglu,et al.  Augmenting Microarray Data with Literature-Based Knowledge to Enhance Gene Regulatory Network Inference , 2014, PLoS Comput. Biol..

[20]  Peer Bork,et al.  Extraction of regulatory gene/protein networks from Medline , 2006, Bioinform..

[21]  Takashi Fujikado,et al.  Analysis of Transcriptional Regulatory Pathways of Photoreceptor Genes by Expression Profiling of the Otx2-Deficient Retina , 2011, PloS one.

[22]  Hao Chen,et al.  Content-rich biological network constructed by mining PubMed abstracts , 2004, BMC Bioinformatics.

[23]  Rolf Apweiler,et al.  GOAnnotator: linking protein GO annotations to evidence text , 2006, Journal of biomedical discovery and collaboration.

[24]  Canglin Wu,et al.  RegNetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse , 2015, Database J. Biol. Databases Curation.

[25]  Elizabeth C Oesterle,et al.  Expression of LHX3 and SOX2 during mouse inner ear development. , 2007, Gene expression patterns : GEP.

[26]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[27]  Andrey A. Puretskiy,et al.  Nonnegative Tensor Factorization of Biomedical Literature for Analysis of Genomic Data , 2014 .

[28]  Joos Vandewalle,et al.  A Multilinear Singular Value Decomposition , 2000, SIAM J. Matrix Anal. Appl..

[29]  Fei Wang,et al.  Tensor factorization toward precision medicine , 2016, Briefings Bioinform..

[30]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[31]  Yen-Wei Chen,et al.  Multi-dimensional data representation using linear tensor coding , 2017, IET Image Process..

[32]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[33]  R. Tjian,et al.  Transcription regulation and animal diversity , 2003, Nature.

[34]  E. Davidson Emerging properties of animal gene regulatory networks , 2010, Nature.

[35]  Michael W. Berry,et al.  Algorithms and applications for approximate nonnegative matrix factorization , 2007, Comput. Stat. Data Anal..

[36]  Hei-Chia Wang,et al.  Inference of transcriptional regulatory network by bootstrapping patterns , 2011, Bioinform..

[37]  G. Golub,et al.  A tensor higher-order singular value decomposition for integrative analysis of DNA microarray data from different studies , 2007, Proceedings of the National Academy of Sciences.

[38]  Michael W. Berry,et al.  Discovering gene functional relationships using FAUN (Feature Annotation Using Nonnegative matrix factorization) , 2010, BMC Bioinformatics.

[39]  Michael W. Berry,et al.  Gene clustering by Latent Semantic Indexing of MEDLINE abstracts , 2005, Bioinform..

[40]  Vladimir B. Bajic,et al.  Dragon TF Association Miner: a system for exploring transcription factor associations through text-mining , 2004, Nucleic Acids Res..

[41]  Haifeng Li,et al.  Integrative Analysis of Many Weighted Co-Expression Networks Using Tensor Computation , 2011, PLoS Comput. Biol..

[42]  Andrey Rzhetsky,et al.  Representation of research hypotheses , 2011, J. Biomed. Semant..

[43]  Jonathan D. Wren,et al.  Clustering microarray-derived gene lists through implicit literature relationships , 2007, Bioinform..

[44]  J. Neves,et al.  Jagged 1 regulates the restriction of Sox2 expression in the developing chicken inner ear: a mechanism for sensory organ specification , 2011, Development.

[45]  Hong Wang,et al.  Tumor Classification Using High-Order Gene Expression Profiles Based on Multilinear ICA , 2009, Adv. Bioinformatics.

[46]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[47]  Efstratios Gallopoulos,et al.  TMG: A MATLAB Toolbox for Generating Term-Document Matrices from Text Collections , 2006, Grouping Multidimensional Data.

[48]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Bernd Fritzsch,et al.  Atoh1 directs hair cell differentiation and survival in the late embryonic mouse inner ear. , 2013, Developmental biology.

[50]  David E. Booth,et al.  Multi-Way Analysis: Applications in the Chemical Sciences , 2005, Technometrics.

[51]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[52]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[53]  Bertram Klinger,et al.  Computer-assisted curation of a human regulatory core network from the biological literature , 2015, Bioinform..

[54]  Ernest Fraenkel,et al.  A Quantitative Model of Transcriptional Regulation Reveals the Influence of Binding Location on Expression , 2010, PLoS Comput. Biol..

[55]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[56]  Nigam H. Shah,et al.  Selected papers from the 13th Annual Bio-Ontologies Special Interest Group Meeting , 2011, J. Biomed. Semant..

[57]  Paul Pavlidis,et al.  Assessing identity, redundancy and confounds in Gene Ontology annotations over time , 2013, Bioinform..

[58]  A. Edge,et al.  Sox2 in the differentiation of cochlear progenitor cells , 2016, Scientific Reports.

[59]  Johan Håstad,et al.  Tensor Rank is NP-Complete , 1989, ICALP.

[60]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[61]  Bülent Yener,et al.  Coupled Analysis of In Vitro and Histology Tissue Samples to Quantify Structure-Function Relationship , 2012, PloS one.

[62]  Michael W. Berry,et al.  Gene Tree Labeling Using Nonnegative Matrix Factorization on Biomedical Literature , 2008, Comput. Intell. Neurosci..

[63]  Karen P. Steel,et al.  Sox2 is required for sensory organ development in the mammalian inner ear , 2005, Nature.

[64]  Huan Liu,et al.  CubeSVD: a novel approach to personalized Web search , 2005, WWW '05.

[65]  Yan Cui,et al.  CbGRiTS: cerebellar gene regulation in time and space. , 2015, Developmental biology.

[66]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[67]  J. Chang,et al.  Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition , 1970 .

[68]  David Z. Chen,et al.  Architecture of the human regulatory network derived from ENCODE data , 2012, Nature.

[69]  Julio Collado-Vides,et al.  Automatic reconstruction of a bacterial regulatory network using Natural Language Processing , 2007, BMC Bioinformatics.

[70]  Rasmus Bro,et al.  A comparison of algorithms for fitting the PARAFAC model , 2006, Comput. Stat. Data Anal..

[71]  Hagit Shatkay,et al.  Discovering semantic features in the literature: a foundation for building functional associations , 2006, BMC Bioinformatics.

[72]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[73]  Bülent Yener,et al.  Modeling and Multiway Analysis of Chatroom Tensors , 2005, ISI.

[74]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[75]  Alioune Ngom,et al.  Non-negative matrix and tensor factorization based classification of clinical microarray gene expression data , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[76]  Michael Krauthammer,et al.  GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data , 2004, J. Biomed. Informatics.

[77]  Hui Liu,et al.  AnimalTFDB: a comprehensive animal transcription factor database , 2011, Nucleic Acids Res..

[78]  Michael W. Berry,et al.  Latent Semantic Indexing of PubMed abstracts for identification of transcription factor candidates from microarray derived gene sets , 2011, BMC Bioinformatics.