Text Mining of Protein Phosphorylation Information Using a Generalizable Rule-Based Approach

Literature-based annotation of protein phosphorylation is the focus of many biological databases, as phosphorylation is a global regulator of cellular activity. To speed up manual curation of phosphorylation information, text mining technology has been utilized. In this paper, we report our ongoing effort to enhance RLIMS-P, a rule-based information extraction (IE) system to identify protein phosphorylation information in scientific literature. Despite the high accuracy attained by RLIMS-P, the use of elaborated patterns and rules resulted in a substantial effort for system development and maintenance. To mitigate this challenge, we redesigned RLIMS-P and integrated new natural language processing (NLP) techniques. It has also been adapted to mine full-text articles and generalized to be able to exploit common features for different post-translational modifications (PTMs). The updated RLIMS-P (version 2.0) was evaluated on abstracts in the publicly available BioNLP GENIA event extraction (GE) corpus, and achieved F-scores of 0.92 and 0.96 for phosphorylation substrate and site, respectively. On a full-text corpus developed in-house, it achieved F-scores of 0.91 and 0.92 for substrate and site, and 0.88 for kinase. The system was applied to the PubMed Central (PMC) Open Access Subset, and promising results have been obtained in mining the full-text articles. RLIMS-P focuses on protein phosphorylation information, but its new design would be generalizable for other PTM types. RLIMS-P version 2.0 is available at: http://proteininformationresource.org/rlimsp/.

[1]  Allegra Via,et al.  Phospho.ELM: a database of phosphorylation sites—update 2008 , 2008, Nucleic Acids Res..

[2]  Cathy H. Wu,et al.  The eFIP system for text mining of protein interaction networks of phosphorylated proteins , 2012, Database J. Biol. Databases Curation.

[3]  K. Bretonnel Cohen,et al.  The structural and content aspects of abstracts versus bodies of full text journal articles are different , 2010, BMC Bioinformatics.

[4]  Nigel Collier,et al.  Zone analysis in biology articles as a basis for information extraction , 2006, Int. J. Medical Informatics.

[5]  Akinori Yonezawa,et al.  The Genia Event and Protein Coreference tasks of the BioNLP Shared Task 2011 , 2012, BMC Bioinformatics.

[6]  Limsoon Wong,et al.  Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[7]  Karin M. Verspoor,et al.  Literature mining of protein-residue associations with graph rules learned through distant supervision , 2012, J. Biomed. Semant..

[8]  K. E. Ravikumar,et al.  A Biological Named Entity Recognizer , 2002, Pacific Symposium on Biocomputing.

[9]  K. E. Ravikumar,et al.  Beyond the clause: extraction of phosphorylation information from medline abstracts , 2005, ISMB.

[10]  Joel D. Martin,et al.  Getting to the (c)ore of knowledge: mining biomedical literature , 2002, Int. J. Medical Informatics.

[11]  The UniProt Consortium,et al.  Reorganizing the protein space at the Universal Protein Resource (UniProt) , 2011, Nucleic Acids Res..

[12]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[13]  Michael R. Seringhaus,et al.  Seeking a New Biology through Text Mining , 2008, Cell.

[14]  Raul Rodriguez-Esteban,et al.  Biomedical Text Mining and Its Applications , 2009, PLoS Comput. Biol..

[15]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[16]  Cathy H. Wu,et al.  eFIP: a tool for mining functional impact of phosphorylation from literature. , 2011, Methods in molecular biology.

[17]  K. E. Ravikumar,et al.  Literature mining and database annotation of protein phosphorylation using a rule-based system , 2005, Bioinform..

[18]  Jari Björne,et al.  Semantically linking molecular entities in literature through entity relationships , 2012, BMC Bioinformatics.

[19]  Yun Xu,et al.  MinePhos: A Literature Mining System for Protein Phoshphorylation Information Extraction , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  Marc Moens,et al.  Articles Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status , 2002, CL.

[21]  Dina Demner-Fushman,et al.  Biomedical Text Mining: A Survey of Recent Progress , 2012, Mining Text Data.

[22]  Jian Zhang,et al.  The Protein Ontology: a structured representation of protein forms and complexes , 2010, Nucleic Acids Res..

[23]  Bin Zhang,et al.  PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse , 2011, Nucleic Acids Res..

[24]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[25]  Allegra Via,et al.  Phospho.ELM: a database of phosphorylation sites—update 2008 , 2007, Nucleic Acids Res..

[26]  Zhiyong Lu,et al.  BioCreative III interactive task: an overview , 2011, BMC Bioinformatics.

[27]  Patrick Ruch,et al.  Application of text-mining for updating protein post-translational modification annotation in UniProtKB , 2012, BMC Bioinformatics.

[28]  L. Grivell,et al.  Text mining for biology - the way forward: opinions from leading scientists , 2008, Genome Biology.

[29]  K. Bretonnel Cohen,et al.  Frontiers of biomedical text mining: current progress , 2007, Briefings Bioinform..

[30]  Sampo Pyysalo,et al.  Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011 , 2012, BMC Bioinformatics.

[31]  Mark Gerstein,et al.  Getting Started in Text Mining: Part Two , 2009, PLoS Comput. Biol..

[32]  Linda A. Watson,et al.  Information Retrieval: A Health and Biomedical Perspective. , 2005 .

[33]  K. Bretonnel Cohen,et al.  Getting Started in Text Mining , 2008, PLoS Comput. Biol..

[34]  Alfonso Valencia,et al.  The Functional Genomics Network in the evolution of biological text mining over the past decade. , 2013, New biotechnology.

[35]  Tony Hunter,et al.  Why nature chose phosphate to modify proteins , 2012, Philosophical Transactions of the Royal Society B: Biological Sciences.

[36]  Yifan Peng,et al.  iSimp: A sentence simplification system for biomedicail text , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine.

[37]  Zhiyong Lu,et al.  OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression , 2008, BMC Bioinformatics.

[38]  Sophia Ananiadou,et al.  Mining the Biomedical Literature , 2015 .