Gene mention normalization and interaction extraction with context models and sentence motifs

Background:The goal of text mining is to make the information conveyed in scientific publications accessible to structured search and automatic analysis. Two important subtasks of text mining are entity mention normalization - to identify biomedical objects in text - and extraction of qualified relationships between those objects. We describe a method for identifying genes and relationships between proteins.Results:We present solutions to gene mention normalization and extraction of protein-protein interactions. For the first task, we identify genes by using background knowledge on each gene, namely annotations related to function, location, disease, and so on. Our approach currently achieves an f-measure of 86.4% on the BioCreative II gene normalization data. For the extraction of protein-protein interactions, we pursue an approach that builds on classical sequence analysis: motifs derived from multiple sequence alignments. The method achieves an f-measure of 24.4% (micro-average) in the BioCreative II interaction pair subtask.Conclusion:For gene mention normalization, our approach outperforms strategies that utilize only the matching of genes names against dictionaries, without invoking further knowledge on each gene. Motifs derived from alignments of sentences are successful at identifying protein interactions in text; the approach we present in this report is fully automated and performs similarly to systems that require human intervention at one or more stages.Availability:Our methods for gene, protein, and species identification, and extraction of protein-protein are available as part of the BioCreative Meta Services (BCMS), see http://bcms.bioinfo.cnio.es/.

[1]  R. Sokal,et al.  A QUANTITATIVE APPROACH TO A PROBLEM IN CLASSIFICATION† , 1957, Evolution; International Journal of Organic Evolution.

[2]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[3]  G. Leech,et al.  Word Frequencies in Written and Spoken English: based on the British National Corpus , 2001 .

[4]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[5]  Alfonso Valencia,et al.  The Frame-Based Module of the SUISEKI Information Extraction System , 2002, IEEE Intell. Syst..

[6]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[7]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[8]  Mark R. Gilder,et al.  Extraction of protein interaction information from unstructured text using a context-free grammar , 2003, Bioinform..

[9]  Anton Yuryev,et al.  Extracting human protein interactions from MEDLINE using a full-sentence parser , 2004, Bioinform..

[10]  Martin Vingron,et al.  IntAct: an open source molecular interaction database , 2004, Nucleic Acids Res..

[11]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[12]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from the literature: Part II , 2005, Bioinform..

[13]  Benchmarking natural-language parsers for biological applications using dependency graphs , 2007, BMC Bioinformatics.

[14]  G. Blobel,et al.  Karyopherin-mediated import of integral inner nuclear membrane proteins , 2006, Nature.

[15]  Fidel Ramírez,et al.  Functional evaluation of domain-domain interactions and human protein interaction networks , 2007, Bioinform..

[16]  Peer Bork,et al.  Extraction of regulatory gene/protein networks from Medline , 2006, Bioinform..

[17]  Dietrich Rebholz-Schuhmann,et al.  Collecting a Large Corpus from all of Medline , 2006, SMBM.

[18]  P. Bork,et al.  Proteome survey reveals modularity of the yeast cell machinery , 2006, Nature.

[19]  Ulf Leser,et al.  ALIBABA: PubMed as a graph , 2006, Bioinform..

[20]  Thomas Lengauer,et al.  A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[21]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[22]  Carol Friedman,et al.  Combining multiple evidence for gene symbol disambiguation , 2007, BioNLP@ACL.

[23]  Pall I. Olason,et al.  A human phenome-interactome network of protein complexes implicated in genetic disorders , 2007, Nature Biotechnology.

[24]  K. Bretonnel Cohen,et al.  Rapid Pattern Development for Concept Recognition Systems: Application to Point mutations , 2007, J. Bioinform. Comput. Biol..

[25]  Jörg Hakenberg,et al.  What’s in a gene name? Automated refinement of gene name dictionaries , 2007, BioNLP@ACL.

[26]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[27]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[28]  Richard Tzong-Han Tsai,et al.  UvA-DARE ( Digital Academic Repository ) Overview of BioCreative II gene mention recognition , 2008 .

[29]  A. Valencia,et al.  Overview of the protein-protein interaction annotation extraction task of BioCreative II , 2008, Genome Biology.

[30]  Chris Sander,et al.  Introducing meta-services for biomedical information extraction , 2008, Genome Biology.

[31]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .