Identification of Biological Relationships from Text Documentsusing Efficient Computational Methods

The biological literature databases continue to grow rapidly with vital information that is important for conducting sound biomedical research and development. The current practices of manually searching for information and extracting pertinent knowledge are tedious, time-consuming tasks even for motivated biological researchers. Accurate and computationally efficient approaches in discovering relationships between biological objects from text documents are important for biologists to develop biological models. The term "object" refers to any biological entity such as a protein, gene, cell cycle, etc. and relationship refers to any dynamic action one object has on another, e.g. protein inhibiting another protein or one object belonging to another object such as, the cells composing an organ. This paper presents a novel approach to extract relationships between multiple biological objects that are present in a text document. The approach involves object identification, reference resolution, ontology and synonym discovery, and extracting object-object relationships. Hidden Markov Models (HMMs), dictionaries, and N-Gram models are used to set the framework to tackle the complex task of extracting object-object relationships. Experiments were carried out using a corpus of one thousand Medline abstracts. Intermediate results were obtained for the object identification process, synonym discovery, and finally the relationship extraction. For the thousand abstracts, 53 relationships were extracted of which 43 were correct, giving a specificity of 81 percent. These results are promising for multi-object identification and relationship finding from biological documents.

[1]  Javed Mostafa,et al.  Detecting Gene Relations from MEDLINE Abstracts , 2000, Pacific Symposium on Biocomputing.

[2]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[3]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[4]  Nigel Collier,et al.  Extracting the Names of Genes and Gene Products with a Hidden Markov Model , 2000, COLING.

[5]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[6]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[7]  Toshihisa Takagi,et al.  PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary , 2000, Bioinform..

[8]  E Pennisi,et al.  The Human Genome , 2001, Science.

[9]  Yusuke Miyao,et al.  Use of a Full Parser for Information Extraction in Molecular Biology Domain , 2000 .

[10]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[11]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[12]  Nigel Collier,et al.  Automatic Term Identification and Classification in Biology Texts. , 1999 .

[13]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[14]  Vasileios Hatzivassiloglou,et al.  Disambiguating proteins, genes, and RNA in text: a machine learning approach , 2001, ISMB.

[15]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[16]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.