Analysis of Protein/Protein Interactions Through Biomedical Literature: Text Mining of Abstracts vs. Text Mining of Full Text Articles

The challenge of knowledge management in the pharmaceutical industry is twofold. First it has to address the integration of sequence data with the vast and growing body of data from functional analysis of genes with the information in huge historical archival databases. Second, as the number of biomedical publications exponentially increases (Medline now contains more than 13 million records), researchers require assistance in order to broaden their vision and comprehension of scientific domains. Analogous to data mining in the sense that it uncovers relationships in information, text mining uncovers relationships in a text collection and leverages the creativity of the knowledge worker in the exploration of these relationships and in the discovery of new knowledge. We describe herein a text mining method to automatically detect protein interactions which are described across a large amount of scientific publications. This method relies on natural language processing to identify protein names, their synonyms and the various interactions they can bear with other proteins. We have then compared text mining analysis on abstracts to the same kind of analysis on full text articles to assess how much information is lost when only abstracts are processed. Our results show that: 1)LexiQuest Mine is a very versatile and accurate tool when mining biomedical literature to analyze interactions between proteins. 2)Mining only abstracts can be sufficient and time saving for applications that do not require a high level of detail on a large scale whereas mining full text articles is to be chosen for more exhaustive applications designed to address a specific issue. Availability: LexiQuest Mine is available for commercial licensing from SPSS, Inc.

[1]  Anton Yuryev,et al.  Extracting human protein interactions from MEDLINE using a full-sentence parser , 2004, Bioinform..

[2]  J. Barker,et al.  Large-scale temporal gene expression mapping of central nervous system development. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from full texts , 2004, Bioinform..

[4]  Nigel Collier,et al.  Extracting the Names of Genes and Gene Products with a Hidden Markov Model , 2000, COLING.

[5]  Limsoon Wong,et al.  PIES, A Protein Interaction Extraction System , 2000, Pacific Symposium on Biocomputing.

[6]  Jong C. Park,et al.  Bidirectional Incremental Parsing for Automatic Pathway Identification with Combinatory Categorial Grammar , 2000, Pacific Symposium on Biocomputing.

[7]  Hamish Cunningham GATE, a General Architecture for Text Engineering , 2002 .

[8]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in biomedical text , 2002, Bioinform..

[9]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Daniel Hanisch,et al.  Playing Biology's Name Game: Identifying Protein Names in Scientific Text , 2002, Pacific Symposium on Biocomputing.

[11]  Vasileios Hatzivassiloglou,et al.  Disambiguating proteins, genes, and RNA in text: a machine learning approach , 2001, ISMB.

[12]  Anton Yuryev,et al.  Research Paper: A Simple and Practical Dictionary-based Approach for Identification of Proteins in Medline Abstracts , 2004, J. Am. Medical Informatics Assoc..

[13]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Fredrik Olsson,et al.  Protein names and how to find them , 2002, Int. J. Medical Informatics.

[15]  Toshihisa Takagi,et al.  Automated extraction of information on protein-protein interactions from the biological literature , 2001, Bioinform..

[16]  Ng,et al.  Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts. , 1999, Genome informatics. Workshop on Genome Informatics.

[17]  Mark R. Gilder,et al.  Extraction of protein interaction information from unstructured text using a context-free grammar , 2003, Bioinform..

[18]  Min Song,et al.  Extracting and mining protein-protein interaction network from biomedical literature , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[19]  Ioannis Xenarios,et al.  Mining literature for protein-protein interactions , 2001, Bioinform..

[20]  C. Friedman,et al.  Using BLAST for identifying gene and protein names in journal articles. , 2000, Gene.

[21]  Jun'ichi Tsujii,et al.  Tuning support vector machines for biomedical named entity recognition , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[22]  G Demetriou,et al.  Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[23]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[24]  Park,et al.  Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. , 1998, Genome informatics. Workshop on Genome Informatics.

[25]  William B. Langdon,et al.  BioRAT: extracting biological information from full-length papers , 2004, Bioinform..

[26]  C. Blaschke,et al.  The potential use of SUISEKI as a protein interaction discovery tool. , 2001, Genome informatics. International Conference on Genome Informatics.

[27]  James Pustejovsky,et al.  Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations , 2001, Pacific Symposium on Biocomputing.

[28]  K. E. Ravikumar,et al.  A Biological Named Entity Recognizer , 2002, Pacific Symposium on Biocomputing.

[29]  Thomas C. Rindflesch,et al.  EDGAR: extraction of drugs, genes and relations from the biomedical literature. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[30]  Allen C. Browne,et al.  Analysis of biomedical text for chemical names: a comparison of three methods , 1999, AMIA.

[31]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Jun'ichi Tsujii,et al.  Event Extraction from Biomedical Papers Using a Full Parser , 2000, Pacific Symposium on Biocomputing.