Automated extraction of information on protein-protein interactions from the biological literature

MOTIVATION To understand biological process, we must clarify how proteins interact with each other. However, since information about protein-protein interactions still exists primarily in the scientific literature, it is not accessible in a computer-readable format. Efficient processing of large amounts of interactions therefore needs an intelligent information extraction method. Our aim is to develop an efficient method for extracting information on protein-protein interaction from scientific literature. RESULTS We present a method for extracting information on protein-protein interactions from the scientific literature. This method, which employs only a protein name dictionary, surface clues on word patterns and simple part-of-speech rules, achieved high recall and precision rates for yeast (recall = 86.8% and precision = 94.3%) and Escherichia coli (recall = 82.5% and precision = 93.5%). The result of extraction suggests that our method should be applicable to any species for which a protein name dictionary is constructed. AVAILABILITY The program is available on request from the authors.

[1]  Peter D. Karp,et al.  EcoCyc: Encyclopedia of Escherichia coli genes and metabolism , 1998, Nucleic Acids Res..

[2]  David Botstein,et al.  SGD: Saccharomyces Genome Database , 1998, Nucleic Acids Res..

[3]  Peter D. Karp,et al.  Eco Cyc: encyclopedia of Escherichia coli genes and metabolism , 1999, Nucleic Acids Res..

[4]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[5]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[6]  Park,et al.  Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. , 1998, Genome informatics. Workshop on Genome Informatics.

[7]  C. Ouzounis,et al.  Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[8]  J M Cherry Genetic nomenclature guide. Saccharomyces cerevisiae. , 1995, Trends in genetics : TIG.

[9]  K Chater,et al.  Genetic nomenclature guide. Bacteria. , 1995, Trends in genetics : TIG.

[10]  Proux,et al.  Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction. , 1998, Genome informatics. Workshop on Genome Informatics.

[11]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[12]  Shalom Lappin,et al.  An Algorithm for Pronominal Anaphora Resolution , 1994, CL.

[13]  Jérôme Euzenat,et al.  Grasping at molecular interactions and genetic networks in Drosophila melanogaster using FlyNets, an Internet database , 1999, Nucleic Acids Res..

[14]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 2000, Nucleic Acids Res..

[15]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[16]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[17]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[18]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.