Improving Distantly Supervised Extraction of Drug-Drug and Protein-Protein Interactions

Relation extraction is frequently and successfully addressed by machine learning methods. The downside of this approach is the need for annotated training data, typically generated in tedious manual, cost intensive work. Distantly supervised approaches make use of weakly annotated data, like automatically annotated corpora. Recent work in the biomedical domain has applied distant supervision for protein-protein interaction (PPI) with reasonable results making use of the IntAct database. Such data is typically noisy and heuristics to filter the data are commonly applied. We propose a constraint to increase the quality of data used for training based on the assumption that no self-interaction of real-world objects are described in sentences. In addition, we make use of the University of Kansas Proteomics Service (KUPS) database. These two steps show an increase of 7 percentage points (pp) for the PPI corpus AIMed. We demonstrate the broad applicability of our approach by using the same workflow for the analysis of drug-drug interactions, utilizing relationships available from the drug database DrugBank. We achieve 37.31% in F1 measure without manually annotated training data on an independent test set.

[1]  Ulf Leser,et al.  A Comprehensive Benchmark of Kernel Methods to Extract Protein–Protein Interactions from Literature , 2010, PLoS Comput. Biol..

[2]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[3]  Md. Faisal Mahbub Chowdhury,et al.  Drug-drug Interaction Extraction Using Composite Kernels , 2011 .

[4]  David S. Wishart,et al.  DrugBank 3.0: a comprehensive resource for ‘Omics’ research on drugs , 2010, Nucleic Acids Res..

[5]  Isabel Segura-Bedmar,et al.  The 1st DDIExtraction-2011 challenge task: Extraction of Drug-Drug Interactions from biomedical texts , 2011 .

[6]  Rafael C. Jimenez,et al.  The IntAct molecular interaction database in 2012 , 2011, Nucleic Acids Res..

[7]  Ted Briscoe,et al.  Biomedical Event Extraction without Training Data , 2009, BioNLP@HLT-NAACL.

[8]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[9]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[10]  Claire Nédellec,et al.  Learning Language in Logic - Genic Interaction Extraction Challenge , 2005 .

[11]  Andrew McCallum,et al.  Modeling Relations and Their Mentions without Labeled Text , 2010, ECML/PKDD.

[12]  Guodong Zhou,et al.  Dependency-Driven Feature-based Learning for Extracting Protein-Protein Interactions from Biomedical Text , 2010, COLING.

[13]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[14]  Jun'ichi Tsujii,et al.  Event Extraction with Complex Event Classification Using Rich Features , 2010, J. Bioinform. Comput. Biol..

[15]  Daniel Berleant,et al.  Mining MEDLINE: Abstracts, Sentences, or Phrases? , 2001, Pacific Symposium on Biocomputing.

[16]  Anne-Lyse Minard,et al.  Feature selection for drug-drug interaction detection using machine-learning based approaches , 2011 .

[17]  Razvan C. Bunescu,et al.  A Shortest Path Dependency Kernel for Relation Extraction , 2005, HLT.

[18]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[19]  Jari Björne,et al.  Comparative analysis of five protein-protein interaction corpora , 2008, BMC Bioinformatics.

[20]  Andrew McCallum,et al.  Collective Cross-Document Relation Extraction Without Labelled Data , 2010, EMNLP.

[21]  Juliane Fluck,et al.  ProMiner: Recognition of Human Gene and Protein Names using regularly updated Dictionaries , 2007 .

[22]  Pierre Zweigenbaum,et al.  Two Different Machine Learning Techniques for Drug-Drug Interaction Extraction , 2011 .

[23]  Javier De Las Rivas,et al.  Protein–Protein Interactions Essentials: Key Concepts to Building and Analyzing Interactome Networks , 2010, PLoS Comput. Biol..

[24]  Elena Beisswanger,et al.  The Extraction of Pharmacogenetic and Pharmacogenomic Relations - A Case Study Using PharmGKB , 2011, Pacific Symposium on Biocomputing.

[25]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[26]  Xiaohua Hu,et al.  Learning an enriched representation from unlabeled data for protein-protein interaction extraction , 2010, BMC Bioinformatics.

[27]  Ulf Leser,et al.  Relation Extraction for Drug-Drug Interactions using Ensemble Learning , 2011 .

[28]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[29]  Yvan Saeys,et al.  Extracting protein-protein interactions from text using rich feature vectors and feature selection , 2008, SMBM 2008.

[30]  Li Gong,et al.  PharmGKB: An Integrated Resource of Pharmacogenomic Data and Knowledge , 2008, Current protocols in bioinformatics.

[31]  Ulf Leser,et al.  Learning Protein–Protein Interaction Extraction using Distant Supervision , 2011 .

[32]  Xue-wen Chen,et al.  KUPS: constructing datasets of interacting and non-interacting protein pairs with associated attributions , 2010, Nucleic Acids Res..

[33]  Sandhya Rani,et al.  Human Protein Reference Database—2009 update , 2008, Nucleic Acids Res..

[34]  Jun'ichi Tsujii,et al.  A Rich Feature Vector for Protein-Protein Interaction Extraction from Multiple Corpora , 2009, EMNLP.

[35]  Jihoon Yang,et al.  Walk-weighted subsequence kernels for protein-protein interaction extraction , 2010, BMC Bioinformatics.

[36]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[37]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[38]  Gabriele Ausiello,et al.  MINT: the Molecular INTeraction database , 2006, Nucleic Acids Res..

[39]  KüffnerRobert,et al.  RelEx---Relation extraction using dependency parse trees , 2007 .

[40]  Chris Cornelis,et al.  Linguistic feature analysis for protein interaction extraction , 2009, BMC Bioinformatics.