Distributed smoothed tree kernel for protein-protein interaction extraction from the biomedical literature

Automatic extraction of protein-protein interaction (PPI) pairs from biomedical literature is a widely examined task in biological information extraction. Currently, many kernel based approaches such as linear kernel, tree kernel, graph kernel and combination of multiple kernels has achieved promising results in PPI task. However, most of these kernel methods fail to capture the semantic relation information between two entities. In this paper, we present a special type of tree kernel for PPI extraction which exploits both syntactic (structural) and semantic vectors information known as Distributed Smoothed Tree kernel (DSTK). DSTK comprises of distributed trees with syntactic information along with distributional semantic vectors representing semantic information of the sentences or phrases. To generate robust machine learning model composition of feature based kernel and DSTK were combined using ensemble support vector machine (SVM). Five different corpora (AIMed, BioInfer, HPRD50, IEPA, and LLL) were used for evaluating the performance of our system. Experimental results show that our system achieves better f-score with five different corpora compared to other state-of-the-art systems.

[1]  Chee Keong Kwoh,et al.  Extracting Protein-Protein Interactions from the Literature Using the Hidden Vector State Model , 2006, International Conference on Computational Science.

[2]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[3]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from full texts , 2004, Bioinform..

[4]  Alessandro Moschitti,et al.  Making Tree Kernels Practical for Natural Language Learning , 2006, EACL.

[5]  Ren Long,et al.  iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework , 2016, Bioinform..

[6]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[7]  Xianpei Han,et al.  A Feature-Enriched Tree Kernel for Relation Extraction , 2014, ACL.

[8]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[9]  Xiao Zhang,et al.  Multiple kernel learning in protein-protein interaction extraction from biomedical literature , 2011, Artif. Intell. Medicine.

[10]  Zhenchao Jiang,et al.  An approach to improve kernel-based Protein-Protein Interaction extraction by learning from large-scale network data. , 2015, Methods.

[11]  Jun'ichi Tsujii,et al.  Syntactic Features for Protein-Protein Interaction Extraction , 2007, LBM.

[12]  Kalpana Raja,et al.  PPInterFinder—a mining tool for extracting causal relations on human proteins from literature , 2013, Database J. Biol. Databases Curation.

[13]  Hady Wirawan Lauw,et al.  A Convolution Kernel Approach to Identifying Comparisons in Text , 2015, ACL.

[14]  Jihoon Yang,et al.  Data and text mining Kernel approaches for genic interaction extraction , 2008 .

[15]  Claire Nédellec,et al.  Learning Language in Logic - Genic Interaction Extraction Challenge , 2005 .

[16]  Guodong Zhou,et al.  Dependency-Driven Feature-based Learning for Extracting Protein-Protein Interactions from Biomedical Text , 2010, COLING.

[17]  Ted Kwartler The OpenNLP Project , 2017 .

[18]  Jun'ichi Tsujii,et al.  Data and text mining , 2005 .

[19]  Javad Zahiri,et al.  Computational Prediction of Protein–Protein Interaction Networks: Algo-rithms and Resources , 2013, Current genomics.

[20]  Zhenchao Jiang,et al.  Integrating Semantic Information into Multiple Kernels for Protein-Protein Interaction Extraction from Biomedical Literatures , 2014, PloS one.

[21]  Hongfei Lin,et al.  A protein-protein interaction extraction approach based on deep neural network , 2016, Int. J. Data Min. Bioinform..

[22]  Peter M. A. Sloot,et al.  A hybrid approach to extract protein-protein interactions , 2011, Bioinform..

[23]  Yung-Chun Chang,et al.  PIPE: a protein–protein interaction passage extraction module for BioCreative challenge , 2016, Database J. Biol. Databases Curation.

[24]  Deyu Zhou,et al.  Biomedical Relation Extraction: From Binary to Complex , 2014, Comput. Math. Methods Medicine.

[25]  Yifan Peng,et al.  Deep learning for extracting protein-protein interactions from biomedical literature , 2017, BioNLP.

[26]  Ren Long,et al.  iRSpot-EL: identify recombination spots with an ensemble learning approach , 2017, Bioinform..

[27]  Fabio Massimo Zanzotto,et al.  Towards Syntax-aware Compositional Distributional Semantic Models , 2014, COLING.

[28]  Jung-Hsien Chiang,et al.  Discovering novel protein-protein interactions by measuring the protein semantic similarity from the biomedical literature , 2014, J. Bioinform. Comput. Biol..

[29]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[30]  Jun'ichi Tsujii,et al.  Protein-protein interaction extraction by leveraging multiple kernels and parsers , 2009, Int. J. Medical Informatics.

[31]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[32]  Jun'ichi Tsujii,et al.  Extracting Protein Interactions from Text with the Unified AkaneRE Event Extraction System , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  Roberto Basili,et al.  Semantic convolution kernels over dependency trees: smoothed partial tree kernel , 2011, CIKM '11.

[34]  Claudio Giuliano,et al.  Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature , 2006, EACL.

[35]  Kuo-Chen Chou,et al.  2L-piRNA: A Two-Layer Ensemble Classifier for Identifying Piwi-Interacting RNAs and Their Function , 2017, Molecular therapy. Nucleic acids.

[36]  Mei Liu,et al.  Prediction of protein-protein interactions using random decision forest framework , 2005, Bioinform..

[37]  Jian Su,et al.  Protein-Protein Interaction Extraction: A Supervised Learning Approach} , 2005 .

[38]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[39]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[40]  Yvan Saeys,et al.  Extracting protein-protein interactions from text using rich feature vectors and feature selection , 2008, SMBM 2008.

[41]  Razvan C. Bunescu,et al.  Integrating Co-occurrence Statistics with Information Extraction for Robust Retrieval of Protein Interactions from Medline , 2006, BioNLP@NAACL-HLT.

[42]  Daniel Berleant,et al.  Mining MEDLINE: Abstracts, Sentences, or Phrases? , 2001, Pacific Symposium on Biocomputing.

[43]  Takenao Ohkawa,et al.  Protein-protein interaction extraction with feature selection by evaluating contribution levels of groups consisting of related features , 2016, BMC Bioinformatics.

[44]  Georgiana Dinu,et al.  DISSECT - DIStributional SEmantics Composition Toolkit , 2013, ACL.

[45]  Michael Collins,et al.  Parsing with a Single Neuron: Convolution Kernels for Natural Language Problems , 2001 .

[46]  Johan A. K. Suykens,et al.  EnsembleSVM: a library for ensemble learning using support vector machines , 2014, J. Mach. Learn. Res..

[47]  Yuwei Wang,et al.  Protein-protein interaction identification using a hybrid model , 2015, Artif. Intell. Medicine.

[48]  Sung-Hyon Myaeng,et al.  Simplicity is Better: Revisiting Single Kernel PPI Extraction , 2010, COLING.