Using syntax features and document discourse for relation extraction on PharmGKB and CTD

We present an approach to the extraction of relations between pharmacogenomics entities like drugs, genes and diseases which is based on syntax and on discourse. Particularly, discourse has not been studied widely for improving Text Mining. We learn syntactic features semi-automatically from lean document-level annotation. We show how a simple Maximum Entropy based machine learning approach helps to estimate the relevance of candidate relations based on dependency-based features found in the syntactic path connecting the involved entities. Maximum Entropy based relevance estimation of candidate pairs conditioned on syntactic features improves relation ranking by 68% relative increase measured by AUCiP/R and by 60% for TAP-k (k=10). We also show that automatically recognizing document-level discourse characteristics to expand and filter acronyms improves term recognition and interaction detection by 12% relative, measured by AUCiP/R and by TAP-k (k=10). Our pilot study uses PharmGKB and CTD as resources.

[1]  Jihoon Yang,et al.  Data and text mining Kernel approaches for genic interaction extraction , 2008 .

[2]  Fabio Rinaldi,et al.  Relation mining experiments in the pharmacogenomics domain , 2012, J. Biomed. Informatics.

[3]  Russ B. Altman,et al.  PharmGKB: Understanding the Effects of Individual Genetic Variants , 2008, Drug metabolism reviews.

[4]  John L. Spouge,et al.  Threshold Average Precision (TAP-k): a measure of retrieval designed for bioinformatics , 2010, Bioinform..

[5]  F Rinaldi,et al.  OntoGene in BioCreative II.5 , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Ralph Grishman,et al.  Using NOMLEX to Produce Nominalization Patterns for Information Extraction , 1998, ACL 1998.

[7]  Bronwen Martin,et al.  Dictionary of Semiotics , 2000 .

[8]  Gerold Schneider,et al.  Hybrid Long-Distance Functional Dependency Parsing , 2009 .

[9]  Elena Beisswanger,et al.  The Extraction of Pharmacogenetic and Pharmacogenomic Relations - A Case Study Using PharmGKB , 2011, Pacific Symposium on Biocomputing.

[10]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[11]  K. Bretonnel Cohen,et al.  Text mining and manual curation of chemical-gene-disease networks for the Comparative Toxicogenomics Database (CTD) , 2009, BMC Bioinformatics.

[12]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..