Pattern Learning through Distant Supervision for Extraction of Protein-Residue Associations in the Biomedical Literature

We propose a method enabling automatic extraction of protein-specific residues from the biomedical literature. We aim to associate mentions of specific amino acids to the protein of which the residue forms a part. The methods presented in this work will enable improved protein functional site extraction from articles, ultimately supporting protein function prediction. Our method made use of linguistic patterns for identifying the amino acid residue mentions in text. Further, we applied an automated graph-based method to learn syntactic and semantic patterns corresponding to protein-residue pairs mentioned in the text. On a new automatically generated data set of high confidence protein-residue relationship sentences, established through distant supervision, the method achieved a F-measure of 0.78. This work will pave the way to improved extraction of protein functional residues from the literature.

[1]  Haibin Liu,et al.  Biological event extraction using subgraph matching , 2010, Semantic Mining in Biomedicine.

[2]  Karin M. Verspoor,et al.  Text Mining Improves Prediction of Protein Functional Sites , 2012, PloS one.

[3]  F Rinaldi,et al.  OntoGene in BioCreative II.5 , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  M. Romacker,et al.  OntoGene in BioCreative II , 2007, Genome Biology.

[5]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[6]  Judith D. Cohn,et al.  Fast dynamics perturbation analysis for prediction of protein functional sites , 2008, BMC Structural Biology.

[7]  Halil Kilicoglu,et al.  Syntactic Dependency Based Heuristics for Biological Event Extraction , 2009, BioNLP@HLT-NAACL.

[8]  Fred E. Cohen,et al.  Automatic Extraction of Protein Point Mutations Using a Graph Bigram Association , 2007, PLoS Comput. Biol..

[9]  Udo Hahn,et al.  Event Extraction from Trimmed Dependency Graphs , 2009, BioNLP@HLT-NAACL.

[10]  René Witte,et al.  Mutation Mining—A Prospector's Tale , 2006, Inf. Syst. Frontiers.

[11]  René Witte,et al.  Towards a Systematic Evaluation of protein Mutation Extraction Systems , 2007, J. Bioinform. Comput. Biol..

[12]  Christopher D. Manning,et al.  The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.

[13]  Karin M. Verspoor,et al.  From Graphs to Events: A Subgraph Matching Approach for Information Extraction from Biomedical Text , 2011, BioNLP@ACL.

[14]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[15]  Sampo Pyysalo,et al.  Overview of BioNLP Shared Task 2013 , 2013, BioNLP@ACL.

[16]  Alessandro Moschitti,et al.  End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories , 2011, ACL.

[17]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[18]  Hongfang Liu,et al.  BioThesaurus: a web-based thesaurus of protein and gene names , 2006, Bioinform..

[19]  René Witte,et al.  Mutation Miner – Textual Annotation of Protein Structures , 2005 .

[20]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[21]  Luke S. Zettlemoyer,et al.  Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations , 2011, ACL.

[22]  Fred E. Cohen,et al.  Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors , 2004, Bioinform..

[23]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[24]  Alexander A. Morgan,et al.  Gene name identification and normalization using a model organism database , 2004, J. Biomed. Informatics.

[25]  Jari Björne,et al.  Extracting Complex Biological Events with Rich Graph-Based Feature Sets , 2009, BioNLP@HLT-NAACL.

[26]  Judith D. Cohn,et al.  Prediction of Functional Sites in SCOP Domains using Dynamics Perturbation Analysis , 2008 .

[27]  K. Bretonnel Cohen,et al.  MutationFinder: a high-performance system for extracting point mutation mentions from text , 2007, Bioinform..

[28]  Ulf Leser,et al.  Not all links are equal: Exploiting Dependency Types for the Extraction of Protein-Protein Interactions from Text , 2011, BioNLP@ACL.

[29]  Christophe Roeder,et al.  Exploring Species-Based Strategies for Gene Normalization , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[31]  Dietrich Rebholz-Schuhmann,et al.  Annotation of protein residues based on a literature analysis: cross-validation against UniProtKb , 2009, BMC Bioinformatics.

[32]  K. Bretonnel Cohen,et al.  High-precision biological event extraction with a concept recognizer , 2009, BioNLP@HLT-NAACL.

[33]  G. Casari,et al.  Automatic extraction of mutations from Medline and cross-validation with OMIM. , 2004, Nucleic acids research.