论文信息 - Automated extraction of information on protein-protein interactions from the biological literature

Automated extraction of information on protein-protein interactions from the biological literature

MOTIVATION To understand biological process, we must clarify how proteins interact with each other. However, since information about protein-protein interactions still exists primarily in the scientific literature, it is not accessible in a computer-readable format. Efficient processing of large amounts of interactions therefore needs an intelligent information extraction method. Our aim is to develop an efficient method for extracting information on protein-protein interaction from scientific literature. RESULTS We present a method for extracting information on protein-protein interactions from the scientific literature. This method, which employs only a protein name dictionary, surface clues on word patterns and simple part-of-speech rules, achieved high recall and precision rates for yeast (recall = 86.8% and precision = 94.3%) and Escherichia coli (recall = 82.5% and precision = 93.5%). The result of extraction suggests that our method should be applicable to any species for which a protein name dictionary is constructed. AVAILABILITY The program is available on request from the authors.

[1] Peter D. Karp,et al. EcoCyc: Encyclopedia of Escherichia coli genes and metabolism , 1998, Nucleic Acids Res..

[2] David Botstein,et al. SGD: Saccharomyces Genome Database , 1998, Nucleic Acids Res..

[3] Peter D. Karp,et al. Eco Cyc: encyclopedia of Escherichia coli genes and metabolism , 1999, Nucleic Acids Res..

[4] T. Takagi,et al. Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[5] Miguel A. Andrade-Navarro,et al. Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[6] Park,et al. Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. , 1998, Genome informatics. Workshop on Genome Informatics.

[7] C. Ouzounis,et al. Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[8] J M Cherry. Genetic nomenclature guide. Saccharomyces cerevisiae. , 1995, Trends in genetics : TIG.

[9] K Chater,et al. Genetic nomenclature guide. Bacteria. , 1995, Trends in genetics : TIG.

[10] Proux,et al. Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction. , 1998, Genome informatics. Workshop on Genome Informatics.

[11] Susumu Goto,et al. KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[12] Shalom Lappin,et al. An Algorithm for Pronominal Anaphora Resolution , 1994, CL.

[13] Jérôme Euzenat,et al. Grasping at molecular interactions and genetic networks in Drosophila melanogaster using FlyNets, an Internet database , 1999, Nucleic Acids Res..

[14] Dmitrij Frishman,et al. MIPS: a database for genomes and protein sequences , 2000, Nucleic Acids Res..

[15] Eric Brill,et al. Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[16] N. W. Davis,et al. The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[17] 中尾光輝,et al. KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[18] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.