RLIMS-P 2.0: A Generalizable Rule-Based Information Extraction System for Literature Mining of Protein Phosphorylation Information

We introduce RLIMS-P version 2.0, an enhanced rule-based information extraction (IE) system for mining kinase, substrate, and phosphorylation site information from scientific literature. Consisting of natural language processing and IE modules, the system has integrated several new features, including the capability of processing full-text articles and generalizability towards different post-translational modifications (PTMs). To evaluate the system, sets of abstracts and full-text articles, containing a variety of textual expressions, were annotated. On the abstract corpus, the system achieved F-scores of 0.91, 0.92, and 0.95 for kinases, substrates, and sites, respectively. The corresponding scores on the full-text corpus were 0.88, 0.91, and 0.92. It was additionally evaluated on the corpus of the 2013 BioNLP-ST GE task, and achieved an F-score of 0.87 for the phosphorylation core task, improving upon the results previously reported on the corpus. Full-scale processing of all abstracts in MEDLINE and all articles in PubMed Central Open Access Subset has demonstrated scalability for mining rich information in literature, enabling its adoption for biocuration and for knowledge discovery. The new system is generalizable and it will be adapted to tackle other major PTM types. RLIMS-P 2.0 online system is available online (http://proteininformationresource.org/rlimsp/) and the developed corpora are available from iProLINK (http://proteininformationresource.org/iprolink/).

[1]  K. E. Ravikumar,et al.  A Biological Named Entity Recognizer , 2002, Pacific Symposium on Biocomputing.

[2]  K. Bretonnel Cohen,et al.  Frontiers of biomedical text mining: current progress , 2007, Briefings Bioinform..

[3]  Sampo Pyysalo,et al.  Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011 , 2012, BMC Bioinformatics.

[4]  Mark Gerstein,et al.  Getting Started in Text Mining: Part Two , 2009, PLoS Comput. Biol..

[5]  Hongfang Liu,et al.  iProLINK: an integrated protein resource for literature mining , 2004, Comput. Biol. Chem..

[6]  Catalina O. Tudor,et al.  BioCreative IV Interactive Task , 2013 .

[7]  J. Silberg,et al.  A transposase strategy for creating libraries of circularly permuted proteins , 2012, Nucleic acids research.

[8]  Yue Wang,et al.  The Genia Event Extraction Shared Task, 2013 Edition - Overview , 2013, BioNLP@ACL.

[9]  Cathy H. Wu,et al.  The eFIP system for text mining of protein interaction networks of phosphorylated proteins , 2012, Database J. Biol. Databases Curation.

[10]  Cathy H. Wu,et al.  Construction of protein phosphorylation networks by data mining, text mining and ontology integration: analysis of the spindle checkpoint , 2013, Database J. Biol. Databases Curation.

[11]  Cathy H. Wu,et al.  Update on genome completion and annotations: Protein Information Resource , 2004, Human Genomics.

[12]  Jian Zhang,et al.  Protein Ontology: a controlled structured network of protein entities , 2013, Nucleic Acids Res..

[13]  Jari Björne,et al.  Extracting Complex Biological Events with Rich Graph-Based Feature Sets , 2009, BioNLP@HLT-NAACL.

[14]  K. E. Ravikumar,et al.  Literature mining and database annotation of protein phosphorylation using a rule-based system , 2005, Bioinform..

[15]  Akinori Yonezawa,et al.  The Genia Event and Protein Coreference tasks of the BioNLP Shared Task 2011 , 2012, BMC Bioinformatics.

[16]  Limsoon Wong,et al.  Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[17]  Cathy H. Wu,et al.  Text Mining of Protein Phosphorylation Information Using a Generalizable Rule-Based Approach , 2013, BCB.

[18]  Yun Xu,et al.  MinePhos: A Literature Mining System for Protein Phoshphorylation Information Extraction , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[19]  Mike Tyers,et al.  PhosphoGRID: a database of experimentally verified in vivo protein phosphorylation sites from the budding yeast Saccharomyces cerevisiae , 2010, Database J. Biol. Databases Curation.

[20]  L. Grivell,et al.  Text mining for biology - the way forward: opinions from leading scientists , 2008, Genome Biology.

[21]  Bin Zhang,et al.  PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse , 2011, Nucleic Acids Res..

[22]  Zhiyong Lu,et al.  BioCreative III interactive task: an overview , 2011, BMC Bioinformatics.

[23]  Tony Hunter,et al.  Why nature chose phosphate to modify proteins , 2012, Philosophical Transactions of the Royal Society B: Biological Sciences.

[24]  K. Bretonnel Cohen,et al.  The structural and content aspects of abstracts versus bodies of full text journal articles are different , 2010, BMC Bioinformatics.

[25]  Yifan Peng,et al.  iSimp: A sentence simplification system for biomedicail text , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine.

[26]  Zhiyong Lu,et al.  OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression , 2008, BMC Bioinformatics.

[27]  Sophia Ananiadou,et al.  Mining the Biomedical Literature , 2015 .

[28]  Peer Bork,et al.  Extraction of regulatory gene/protein networks from Medline , 2006, Bioinform..

[29]  Manabu Torii,et al.  RLIMS-P: an online text-mining tool for literature-based extraction of protein phosphorylation information , 2014, Database J. Biol. Databases Curation.

[30]  Raul Rodriguez-Esteban,et al.  Biomedical Text Mining and Its Applications , 2009, PLoS Comput. Biol..

[31]  K. Bretonnel Cohen,et al.  Getting Started in Text Mining , 2008, PLoS Comput. Biol..

[32]  Nigel Collier,et al.  Zone analysis in biology articles as a basis for information extraction , 2006, Int. J. Medical Informatics.

[33]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[34]  Halil Kilicoglu,et al.  Biological event composition , 2012, BMC Bioinformatics.

[35]  Marc Moens,et al.  Articles Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status , 2002, CL.

[36]  Cathy H. Wu,et al.  The PIR-International Protein Sequence Database , 1999, Nucleic Acids Res..

[37]  Dina Demner-Fushman,et al.  Biomedical Text Mining: A Survey of Recent Progress , 2012, Mining Text Data.

[38]  Alfonso Valencia,et al.  The Functional Genomics Network in the evolution of biological text mining over the past decade. , 2013, New biotechnology.

[39]  K. E. Ravikumar,et al.  Beyond the clause: extraction of phosphorylation information from medline abstracts , 2005, ISMB.

[40]  Joel D. Martin,et al.  Getting to the (c)ore of knowledge: mining biomedical literature , 2002, Int. J. Medical Informatics.

[41]  D. Inzé,et al.  The Potential of Text Mining in Data Integration and Network Biology for Plant Research: A Case Study on Arabidopsis[C][W] , 2013, Plant Cell.

[42]  Linda A. Watson,et al.  Information Retrieval: A Health and Biomedical Perspective. , 2005 .

[43]  The UniProt Consortium,et al.  Reorganizing the protein space at the Universal Protein Resource (UniProt) , 2011, Nucleic Acids Res..

[44]  Michael R. Seringhaus,et al.  Seeking a New Biology through Text Mining , 2008, Cell.

[45]  Cathy H. Wu,et al.  Use of the Protein Ontology for Multi-Faceted Analysis of Biological Processes: A Case Study of the Spindle Checkpoint , 2013, Front. Genet..

[46]  Jari Björne,et al.  Semantically linking molecular entities in literature through entity relationships , 2012, BMC Bioinformatics.

[47]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[48]  Allegra Via,et al.  Phospho.ELM: a database of phosphorylation sites—update 2008 , 2007, Nucleic Acids Res..

[49]  Karin M. Verspoor,et al.  Literature mining of protein-residue associations with graph rules learned through distant supervision , 2012, J. Biomed. Semant..

[50]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[51]  Patrick Ruch,et al.  Application of text-mining for updating protein post-translational modification annotation in UniProtKB , 2012, BMC Bioinformatics.