A Domain-independent Rule-based Framework for Event Extraction

We describe the design, development, and API of ODIN (Open Domain INformer), a domainindependent, rule-based event extraction (EE) framework. The proposed EE approach is: simple (most events are captured with simple lexico-syntactic patterns), powerful (the language can capture complex constructs, such as events taking other events as arguments, and regular expressions over syntactic graphs), robust (to recover from syntactic parsing errors, syntactic patterns can be freely mixed with surface, token-based patterns), and fast (the runtime environment processes 110 sentences/second in a real-world domain with a grammar of over 200 rules). We used this framework to develop a grammar for the biochemical domain, which approached human performance. Our EE framework is accompanied by a web-based user interface for the rapid development of event grammars and visualization of matches. The ODIN framework and the domain-specific grammars are available as open-source code.

[1]  Yifan Peng,et al.  A generalizable NLP framework for fast development of pattern-based biomedical relation extraction systems , 2014, BMC Bioinformatics.

[2]  Yifan He,et al.  ICE: Rapid Information Extraction Customization for NLP Novices , 2015, HLT-NAACL.

[3]  Douglas E. Appelt,et al.  The Common Pattern Specification Language , 1998, TIPSTER.

[4]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[5]  Sampo Pyysalo,et al.  Overview of the Pathway Curation (PC) task of BioNLP Shared Task 2013 , 2013, BioNLP@ACL.

[6]  Frederick Reiss,et al.  Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks , 2010, EMNLP.

[7]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[8]  Erik M. van Mulligen,et al.  A fast rule-based approach for biomedical event extraction , 2013, BioNLP@ACL.

[9]  Frederick Reiss,et al.  Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! , 2013, EMNLP.

[10]  H. Cunningham,et al.  Developing Language Processing Components with GATE , 2001 .

[11]  Zhiyong Lu,et al.  OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression , 2008, BMC Bioinformatics.

[12]  Roger Levy,et al.  Tregex and Tsurgeon: tools for querying and manipulating tree data structures , 2006, LREC.

[13]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from full texts , 2004, Bioinform..

[14]  Christopher D. Manning,et al.  The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.