Discovering semantic features in the literature: a foundation for building functional associations

BackgroundExperimental techniques such as DNA microarray, serial analysis of gene expression (SAGE) and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins at different levels. As in any other experimental approach, it is necessary to analyze these data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research.ResultsWe present a method to create literature profiles for large sets of genes or proteins based on common semantic features extracted from a corpus of relevant documents. These profiles can be used to establish pair-wise similarities among genes, utilized in gene/protein classification or can be even combined with experimental measurements. Semantic features can be used by researchers to facilitate the understanding of the commonalities indicated by experimental results. Our approach is based on non-negative matrix factorization (NMF), a machine-learning algorithm for data analysis, capable of identifying local patterns that characterize a subset of the data. The literature is thus used to establish putative relationships among subsets of genes or proteins and to provide coherent justification for this clustering into subsets. We demonstrate the utility of the method by applying it to two independent and vastly different sets of genes.ConclusionThe presented method can create literature profiles from documents relevant to sets of genes. The representation of genes as additive linear combinations of semantic features allows for the exploration of functional associations as well as for clustering, suggesting a valuable methodology for the validation and interpretation of high-throughput experimental data.

[1]  Darrell Laham,et al.  From paragraph to graph: Latent semantic analysis for information visualization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[3]  R. Altman,et al.  Using text analysis to identify functionally coherent gene groups. , 2002, Genome research.

[4]  A Aszódi,et al.  High-throughput functional annotation of novel gene products using document clustering. , 2000, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[5]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[6]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[7]  Marti A. Hearst Untangling Text Data Mining , 1999, ACL.

[8]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[9]  Noguchi,et al.  Parallel Protein Information Analysis (PAPIA) System Running on a 64-Node PC Cluster. , 1998, Genome informatics. Workshop on Genome Informatics.

[10]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[11]  C. Ball,et al.  Saccharomyces Genome Database. , 2002, Methods in enzymology.

[12]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[13]  Dietrich Lehmann,et al.  Nonsmooth nonnegative matrix factorization (nsNMF) , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Philip M. Kim,et al.  Subsystem identification through dimensionality reduction of large-scale gene expression data. , 2003, Genome research.

[15]  Bart De Moor,et al.  Evaluation of the Vector Space Representation in Text-Based Gene Clustering , 2002, Pacific Symposium on Biocomputing.

[16]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[17]  Anton J. Enright,et al.  TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept Discovery In Molecular Biology , 2000, Pacific Symposium on Biocomputing.

[18]  Hagit Shatkay,et al.  Genes, Themes, and Microarrays: Using Information Retrieval for Large-Scale Gene Analysis , 2000, ISMB.

[19]  J. Otlewski,et al.  Canonical protein inhibitors of serine proteases , 2003, Cellular and Molecular Life Sciences CMLS.

[20]  H. Shatkey,et al.  Finding themes in Medline documents - probabilistic similarity search , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[21]  Michael W. Berry,et al.  Gene clustering by Latent Semantic Indexing of MEDLINE abstracts , 2005, Bioinform..

[22]  P. Gettins Serpin structure, mechanism, and function. , 2002, Chemical reviews.

[23]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[24]  Liisa Holm,et al.  Sensitive pattern discovery with 'fuzzy' alignments of distantly related proteins , 2003, ISMB.

[25]  Damian Smedley,et al.  Ensembl 2005 , 2004, Nucleic Acids Res..

[26]  Anne-Lise Veuthey,et al.  Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation , 2003, ISMB.

[27]  B. De Moor,et al.  TXTGate: profiling gene groups with text-based information , 2004, Genome Biology.

[28]  A. Valencia,et al.  Mining functional information associated with expression arrays , 2001, Functional & Integrative Genomics.

[29]  Amos Bairoch,et al.  The PROSITE database, its status in 2002 , 2002, Nucleic Acids Res..

[30]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[31]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[33]  Wesley W. Chu,et al.  Free-text medical document retrieval via phrase-based vector space model , 2002, AMIA.

[34]  D. Fairlie,et al.  Proteases universally recognize beta strands in their active sites. , 2005, Chemical reviews.

[35]  Jonathan D. Wren,et al.  Shared relationship analysis: ranking set cohesion and commonalities within a literature-derived relationship network , 2004, Bioinform..

[36]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[37]  Neil D. Rawlings,et al.  MEROPS: the protease database , 2002, Nucleic Acids Res..

[38]  Neil D Rawlings,et al.  Evolutionary families of peptidase inhibitors. , 2004, The Biochemical journal.

[39]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[40]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[41]  R M Jackson,et al.  The serine protease inhibitor canonical loop conformation: examples found in extracellular hydrolases, toxins, cytokines and viral proteins. , 2000, Journal of molecular biology.

[42]  Susan T. Dumais,et al.  Improving information retrieval using latent semantic indexing , 1988 .

[43]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[44]  Petri Törönen,et al.  Theme discovery from gene lists for identification and viewing of multiple functional groups , 2005, BMC Bioinformatics.

[45]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[46]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[47]  D. Chaussabel,et al.  Mining microarray expression data by literature profiling , 2002, Genome Biology.

[48]  M. Kanehisa A database for post-genome analysis. , 1997, Trends in genetics : TIG.

[49]  Kenji Kita,et al.  Dimensionality reduction using non-negative matrix factorization for information retrieval , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).

[50]  Michael W. Berry,et al.  Document clustering using nonnegative matrix factorization , 2006, Inf. Process. Manag..

[51]  Erik M. van Mulligen,et al.  Co-occurrence based meta-analysis of scientific texts: retrieving biological relationships between genes , 2005, Bioinform..

[52]  Ralf Zimmer,et al.  Expert knowledge without the expert: integrated analysis of gene expression and literature to derive active functional contexts , 2005, ECCB/JBI.