Literature mining of protein phosphorylation using dependency parse trees.

As one of the most common post-translational modifications (PTMs), protein phosphorylation plays an important role in various biological processes, such as signaling transduction, cellular metabolism, differentiation, growth, regulation and apoptosis. Protein phosphorylation is of great value not only in illustrating the underlying molecular mechanisms but also in treatment of diseases and design of new drugs. Recently, there is an increasing interest in automatically extracting phosphorylation information from biomedical literatures. However, it still remains a challenging task due to the tremendous volume of literature and circuitous modes of expression for protein phosphorylation. To address this issue, we propose a novel text-mining method for efficiently retrieving and extracting protein phosphorylation information from literature. By employing natural language processing (NLP) technologies, this method transforms each sentence into dependency parse trees that can precisely reflect the intrinsic relationship of phosphorylation-related key words, from which detailed information of substrates, kinases and phosphorylation sites is extracted based on syntactic patterns. Compared with other existing approaches, the proposed method demonstrates significantly improved performance, suggesting it is a powerful bioinformatics approach to retrieving phosphorylation information from a large amount of literature. A web server for the proposed method is freely available at http://bioinformatics.ustc.edu.cn/pptm/.

[1]  Hongfang Liu,et al.  iProLINK: an integrated protein resource for literature mining , 2004, Comput. Biol. Chem..

[2]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[3]  Allegra Via,et al.  Phospho.ELM: a database of phosphorylation sites—update 2008 , 2007, Nucleic Acids Res..

[4]  K. E. Ravikumar,et al.  Literature mining and database annotation of protein phosphorylation using a rule-based system , 2005, Bioinform..

[5]  Ted Briscoe,et al.  Corpus Annotation for Parser Evaluation , 1999, ArXiv.

[6]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[7]  P. Cohen,et al.  The role of protein phosphorylation in neural and hormonal control of cellular activity , 1982, Nature.

[8]  Sampo Pyysalo,et al.  Proceedings of the BioNLP Shared Task 2011 Workshop , 2011 .

[9]  Xiaohua Hu,et al.  Learning an enriched representation from unlabeled data for protein-protein interaction extraction , 2010, BMC Bioinformatics.

[10]  K. E. Ravikumar,et al.  Beyond the clause: extraction of phosphorylation information from medline abstracts , 2005, ISMB.

[11]  Ion Muslea,et al.  Extraction Patterns for Information Extraction Tasks: A Survey , 1999 .

[12]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.

[13]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[14]  Dietrich Rebholz-Schuhmann,et al.  Assessment of disease named entity recognition on a corpus of annotated sentences , 2008, BMC Bioinformatics.

[15]  Jun'ichi Tsujii Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task , 2009 .

[16]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[17]  K. E. Ravikumar,et al.  An online literature mining tool for protein phosphorylation , 2006, Bioinform..