EXPLORING A SUBGRAPH MATCHING APPROACH FOR EXTRACTING BIOLOGICAL EVENTS FROM LITERATURE

An important task in biological information extraction is to identify descriptions of biological relations and events involving genes or proteins. In this work, we propose a graph‐based approach to automatically learn rules for detecting biological events in the life science literature. The event rules are learned by identifying the key contextual dependencies from full parsing of annotated text. The detection is performed by searching for isomorphism between event rules and the dependency graphs of complete sentences. When applying our approach to the data sets of the Task 1 of the BioNLP‐ST 2009, we achieved a 40.71% F‐score in detecting biological events across nine event types. Our 56.32% precision is comparable with the state‐of‐the‐art systems. The approach may also be generalized to extract events from other domains where training data are available because it requires neither manual intervention nor external domain‐specific resources. The subgraph matching algorithm we developed is released under the new BSD license and can be downloaded from http://esmalgorithm.sourceforge.net.

[1]  Steven Skiena,et al.  Computational Discrete Mathematics: Combinatorics and Graph Theory with Mathematica ® , 2009 .

[2]  Siddhartha Jonnalagadda,et al.  Towards Effective Sentence Simplification for Automatic Processing of Biomedical Text , 2009, HLT-NAACL.

[3]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[4]  Claire Nedellec Machine Learning applied to Information Extraction in specific domains — an example, gene interaction extraction from bibliography in genomics , 2002 .

[5]  Isabelle Bloch,et al.  Inexact graph matching by means of estimation of distribution algorithms , 2002, Pattern Recognit..

[6]  Jun'ichi Tsujii,et al.  A Markov Logic Approach to Bio-Molecular Event Extraction , 2009, BioNLP@HLT-NAACL.

[7]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[8]  Fabio Rinaldi,et al.  Mining of relations between proteins over biomedical scientific literature using a deep-linguistic approach , 2007, Artif. Intell. Medicine.

[9]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[10]  Claire Nédellec,et al.  Learning Language in Logic - Genic Interaction Extraction Challenge , 2005 .

[11]  Ulf Leser,et al.  Not all links are equal: Exploiting Dependency Types for the Extraction of Protein-Protein Interactions from Text , 2011, BioNLP@ACL.

[12]  Christopher D. Manning,et al.  The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.

[13]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[14]  Alexander A. Morgan,et al.  Investigation of Unsupervised Pattern Learning Techniques for Bootstrap Construction of a Medical Treatment Lexicon , 2009, BioNLP@HLT-NAACL.

[15]  Jari Björne,et al.  A Graph Kernel for Protein-Protein Interaction Extraction , 2008, BioNLP.

[16]  Ambuj K. Singh,et al.  Closure-Tree: An Index Structure for Graph Queries , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[17]  C. Ouzounis,et al.  Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[18]  Daniel Jurafsky,et al.  Parsing to Stanford Dependencies: Trade-offs between Speed and Accuracy , 2010, LREC.

[19]  Kyu-Chul Lee,et al.  Finding the evidence for protein-protein interactions from PubMed abstracts , 2006, ISMB.

[20]  Sampo Pyysalo,et al.  Evaluating Dependency Representations for Event Extraction , 2010, COLING.

[21]  Jun'ichi Tsujii,et al.  Evaluating contributions of natural language parsers to protein–protein interaction extraction , 2008, Bioinform..

[22]  Jari Björne,et al.  Extracting Complex Biological Events with Rich Graph-Based Feature Sets , 2009, BioNLP@HLT-NAACL.

[23]  Amit P. Sheth,et al.  Unsupervised Discovery of Compound Entities for Relationship Extraction , 2008, EKAW.

[24]  Yusuke Miyao,et al.  Challenges in Mapping of Syntactic Representations for Framework-Independent Parser Evaluation , 2007 .

[25]  Jignesh M. Patel,et al.  SAGA: a subgraph matching tool for biological graphs , 2007, Bioinform..

[26]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[27]  Hoifung Poon,et al.  Joint Inference for Knowledge Extraction from Biomedical Literature , 2010, NAACL.

[28]  K. Bretonnel Cohen,et al.  High-precision biological event extraction with a concept recognizer , 2009, BioNLP@HLT-NAACL.

[29]  Ulf Leser,et al.  A Comprehensive Benchmark of Kernel Methods to Extract Protein–Protein Interactions from Literature , 2010, PLoS Comput. Biol..

[30]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[31]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[32]  Junichi Tsujii,et al.  Event extraction for systems biology by text mining the literature. , 2010, Trends in biotechnology.

[33]  Mario Vento,et al.  A (sub)graph isomorphism algorithm for matching large graphs , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Mariana L. Neves,et al.  Extraction of biomedical events using case-based reasoning , 2009, BioNLP@HLT-NAACL.

[35]  Claire Nedellec,et al.  Sentence Filtering for Information Extraction in Genomics, a Classification Problem , 2001, PKDD.

[36]  Shijie Zhang,et al.  GADDI: distance index based subgraph matching in biological networks , 2009, EDBT '09.

[37]  Endika Bengoetxea,et al.  Inexact Graph Matching Using Estimation of Distribution Algorithms , 2002 .

[38]  A. Valencia,et al.  Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge , 2008, Genome Biology.

[39]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[40]  Jun'ichi Tsujii,et al.  Event Extraction with Complex Event Classification Using Rich Features , 2010, J. Bioinform. Comput. Biol..

[41]  Mehmet M. Dalkilic,et al.  From protein-disease associations to disease informatics. , 2008, Frontiers in bioscience : a journal and virtual library.

[42]  Philip S. Yu,et al.  Searching Substructures with Superimposed Distance , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[43]  Haibin Liu,et al.  Biological event extraction using subgraph matching , 2010, Semantic Mining in Biomedicine.

[44]  Udo Hahn,et al.  Event Extraction from Trimmed Dependency Graphs , 2009, BioNLP@HLT-NAACL.

[45]  Sang Uk Lee,et al.  Attributed relational graph matching based on the nested assignment structure , 2010, Pattern Recognit..

[46]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[47]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[48]  Ulf Leser,et al.  Molecular event extraction from Link Grammar parse trees , 2009, BioNLP@HLT-NAACL.

[49]  Gérard P. Huet,et al.  A Unification Algorithm for Typed lambda-Calculus , 1975, Theor. Comput. Sci..

[50]  Arthur C. Graesser,et al.  Evaluating State-of-the-Art Treebank-style Parsers for Coh-Metrix and Other Learning Technology Environments , 2005 .

[51]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[52]  Vincent Ng,et al.  Anaphora resolution in biomedical literature: a hybrid approach , 2012, BCB.

[53]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[54]  F Rinaldi,et al.  OntoGene in BioCreative II.5 , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[55]  Kaleem Siddiqi,et al.  Matching Hierarchical Structures Using Association Graphs , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[56]  Edwin R. Hancock,et al.  Graph matching using the interference of continuous-time quantum walks , 2009, Pattern Recognit..

[57]  Halil Kilicoglu,et al.  Syntactic Dependency Based Heuristics for Biological Event Extraction , 2009, BioNLP@HLT-NAACL.

[58]  Jakub Kanis,et al.  Comparison of Different Lemmatization Approaches through the Means of Information Retrieval Performance , 2010, TSD.

[59]  Caroline Gasperin,et al.  Semi-supervised anaphora resolution in biomedical texts , 2006, BioNLP@NAACL-HLT.

[60]  Karin M. Verspoor,et al.  Pattern Learning through Distant Supervision for Extraction of Protein-Residue Associations in the Biomedical Literature , 2011, 2011 10th International Conference on Machine Learning and Applications and Workshops.

[61]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[62]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[63]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[64]  Udo Hahn,et al.  Evaluating the Impact of Alternative Dependency Graph Encodings on Solving Event Extraction Tasks , 2010, EMNLP.

[65]  Martti Juhola,et al.  Stemming and lemmatization in the clustering of finnish text documents , 2004, CIKM '04.

[66]  Razvan C. Bunescu,et al.  A Shortest Path Dependency Kernel for Relation Extraction , 2005, HLT.

[67]  Ted Briscoe,et al.  Statistical Anaphora Resolution in Biomedical Texts , 2008, COLING.

[68]  Lipika Dey,et al.  Biological relation extraction and query answering from MEDLINE abstracts using ontology-based text mining , 2007, Data Knowl. Eng..

[69]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from full texts , 2004, Bioinform..

[70]  Tapio Salakoski,et al.  On the unification of syntactic annotations under the Stanford dependency scheme: A case study on BioInfer and GENIA , 2007, BioNLP@ACL.