Biomedical Relationship Extraction from Literature Based on Bio-semantic Token Subsequences

Relationship Extraction (RE) from biomedical literature is an important and challenging problem in both text mining and bioinformatics. Although various approaches have been proposed to extract protein-protein interaction types, their accuracy rates leave a large room for further exploration of more effective methods. In this paper, two supervised learning algorithms based on newly-defined “bio-semantic token subsequence” are proposed for multi-class biomedical relationship extraction. The first approach calculates a “bio-semantic token subsequence kernel”, while the second one explicitly extracts weighted features from bio-semantic token subsequences. The proposed structure called “bio-semantic token subsequence” is able to capture semantic features from natural language sentences for biomedical RE. Two supervised learning algorithms based on the proposed structure outperform the state-of-the-art biomedical RE methods on multi-class protein-protein interaction extraction.

[1]  Jean-Michel Renders,et al.  Word-Sequence Kernels , 2003, J. Mach. Learn. Res..

[2]  Jude W. Shavlik,et al.  Learning Ensembles of First-Order Clauses for Recall-Precision Curves: A Case Study in Biomedical Information Extraction , 2004, ILP.

[3]  C. Ouzounis,et al.  Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[4]  Dragomir R. Radev,et al.  Semi-Supervised Classification for Extracting Protein Interaction Sentences using Dependency Parsing , 2007, EMNLP.

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  Dmitry Zelenko,et al.  Kernel Methods for Relation Extraction , 2002, J. Mach. Learn. Res..

[7]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[8]  Russ B Altman,et al.  Extracting and characterizing gene-drug relationships from the literature. , 2004, Pharmacogenetics.

[9]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[10]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[11]  Razvan C. Bunescu,et al.  Subsequence Kernels for Relation Extraction , 2005, NIPS.

[12]  Anton Yuryev,et al.  Extracting human protein interactions from MEDLINE using a full-sentence parser , 2004, Bioinform..

[13]  Barbara Rosario,et al.  Multi-way Relation Classification: Application to Protein-Protein Interactions , 2005, HLT.

[14]  Mark Craven,et al.  Representing Sentence Structure in Hidden Markov Models for Information Extraction , 2001, IJCAI.

[15]  Alfonso Valencia,et al.  The Frame-Based Module of the SUISEKI Information Extraction System , 2002, IEEE Intell. Syst..

[16]  Jiawei Han,et al.  Discriminative Frequent Pattern Analysis for Effective Classification , 2007, 2007 IEEE 23rd International Conference on Data Engineering.