Mining protein phosphorylation information from biomedical literature using NLP parsing and Support Vector Machines

BACKGROUND Extraction of protein phosphorylation information from biomedical literature has gained much attention because of the importance in numerous biological processes. OBJECTIVE In this study, we propose a text mining methodology which consists of two phases, NLP parsing and SVM classification to extract phosphorylation information from literature. METHODS First, using NLP parsing we divide the data into three base-forms depending on the biomedical entities related to phosphorylation and further classify into ten sub-forms based on their distribution with phosphorylation keyword. Next, we extract the phosphorylation entity singles/pairs/triplets and apply SVM to classify the extracted singles/pairs/triplets using a set of features applicable to each sub-form. RESULTS The performance of our methodology was evaluated on three corpora namely PLC, iProLink and hPP corpus. We obtained promising results of >85% F-score on ten sub-forms of training datasets on cross validation test. Our system achieved overall F-score of 93.0% on iProLink and 96.3% on hPP corpus test datasets. Furthermore, our proposed system achieved best performance on cross corpus evaluation and outperformed the existing system with recall of 90.1%. CONCLUSIONS The performance analysis of our unique system on three corpora reveals that it extracts protein phosphorylation information efficiently in both non-organism specific general datasets such as PLC and iProLink, and human specific dataset such as hPP corpus.

[1]  Luciano Milanesi,et al.  Systematic analysis of human kinase genes: a large number of genes and alternative splicing events result in functional and structural diversity , 2005, BMC Bioinformatics.

[2]  K. E. Ravikumar,et al.  Literature mining and database annotation of protein phosphorylation using a rule-based system , 2005, Bioinform..

[3]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[4]  Roger Levy,et al.  Tregex and Tsurgeon: tools for querying and manipulating tree data structures , 2006, LREC.

[5]  Chi-Ying F. Huang,et al.  PhosphoPOINT: a comprehensive human kinase interactome and phospho-protein database , 2008, ECCB.

[6]  Chris Cornelis,et al.  Linguistic feature analysis for protein interaction extraction , 2009, BMC Bioinformatics.

[7]  Yun Xu,et al.  MinePhos: A Literature Mining System for Protein Phoshphorylation Information Extraction , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Sampo Pyysalo,et al.  Overview of BioNLP Shared Task 2013 , 2013, BioNLP@ACL.

[9]  P. Cohen,et al.  The origins of protein phosphorylation , 2002, Nature Cell Biology.

[10]  Bin Zhang,et al.  PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse , 2011, Nucleic Acids Res..

[11]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[12]  Jari Björne,et al.  Complex event extraction at PubMed scale , 2010, Bioinform..

[13]  Sophia Ananiadou,et al.  Text mining and its potential applications in systems biology. , 2006, Trends in biotechnology.

[14]  Jun'ichi Tsujii,et al.  Event Extraction with Complex Event Classification Using Rich Features , 2010, J. Bioinform. Comput. Biol..

[15]  Peter M. A. Sloot,et al.  A hybrid approach to extract protein-protein interactions , 2011, Bioinform..

[16]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[17]  Kalpana Raja,et al.  PPInterFinder—a mining tool for extracting causal relations on human proteins from literature , 2013, Database J. Biol. Databases Curation.

[18]  Tony Pawson,et al.  Kinome signaling through regulated protein-protein interactions in normal and cancer cells. , 2009, Current opinion in cell biology.

[19]  T. Hunter,et al.  The Protein Kinase Complement of the Human Genome , 2002, Science.

[20]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[21]  Allegra Via,et al.  Phospho.ELM: a database of phosphorylation sites—update 2008 , 2007, Nucleic Acids Res..

[22]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[23]  Adrian J. Shepherd,et al.  A realistic assessment of methods for extracting gene/protein interactions from free text , 2009, BMC Bioinformatics.

[24]  Cathy H. Wu,et al.  RLIMS-P 2.0: A Generalizable Rule-Based Information Extraction System for Literature Mining of Protein Phosphorylation Information , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  Nikolaj Blom,et al.  PhosphoBase, a database of phosphorylation sites: release 2.0 , 1999, Nucleic Acids Res..

[26]  R. Daly,et al.  Targeting the human kinome for cancer therapy: current perspectives. , 2012, Critical reviews in oncogenesis.

[27]  Jinfeng Zhang,et al.  Bayesian inference of protein-protein interactions from biological literature , 2009, Bioinform..