TrigNER: automatically optimized biomedical event trigger recognition on scientific documents

BackgroundCellular events play a central role in the understanding of biological processes and functions, providing insight on both physiological and pathogenesis mechanisms. Automatic extraction of mentions of such events from the literature represents an important contribution to the progress of the biomedical domain, allowing faster updating of existing knowledge. The identification of trigger words indicating an event is a very important step in the event extraction pipeline, since the following task(s) rely on its output. This step presents various complex and unsolved challenges, namely the selection of informative features, the representation of the textual context, and the selection of a specific event type for a trigger word given this context.ResultsWe propose TrigNER, a machine learning-based solution for biomedical event trigger recognition, which takes advantage of Conditional Random Fields (CRFs) with a high-end feature set, including linguistic-based, orthographic, morphological, local context and dependency parsing features. Additionally, a completely configurable algorithm is used to automatically optimize the feature set and training parameters for each event type. Thus, it automatically selects the features that have a positive contribution and automatically optimizes the CRF model order, n-grams sizes, vertex information and maximum hops for dependency parsing features. The final output consists of various CRF models, each one optimized to the linguistic characteristics of each event type.ConclusionsTrigNER was tested in the BioNLP 2009 shared task corpus, achieving a total F-measure of 62.7 and outperforming existing solutions on various event trigger types, namely gene expression, transcription, protein catabolism, phosphorylation and binding. The proposed solution allows researchers to easily apply complex and optimized techniques in the recognition of biomedical event triggers, making its application a simple routine task. We believe this work is an important contribution to the biomedical text mining community, contributing to improved and faster event recognition on scientific articles, and consequent hypothesis generation and knowledge discovery. This solution is freely available as open source at http://bioinformatics.ua.pt/trigner.

[1]  Lucía Hilario Pérez,et al.  Author Details , 2020, Brexit and Tourism.

[2]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[3]  Hanna M. Wallach,et al.  Conditional Random Fields: An Introduction , 2004 .

[4]  Akinori Yonezawa,et al.  Overview of Genia Event Task in BioNLP Shared Task 2011 , 2011, BioNLP@ACL.

[5]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[6]  Jun'ichi Tsujii,et al.  Dependency Parsing and Domain Adaptation with LR Models and Parser Ensembles , 2007, EMNLP.

[7]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8]  Jun'ichi Tsujii,et al.  Event Extraction with Complex Event Classification Using Rich Features , 2010, J. Bioinform. Comput. Biol..

[9]  Nigel Collier,et al.  Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications , 2004 .

[10]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[11]  José Luís Oliveira,et al.  A modular framework for biomedical concept recognition , 2013, BMC Bioinformatics.

[12]  Timothy Baldwin,et al.  Biomedical Event Annotation with CRFs and Precision Grammars , 2009, BioNLP@HLT-NAACL.

[13]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[14]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[15]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[16]  José Luís Oliveira,et al.  Gimli: open source and high-performance biomedical name recognition , 2013, BMC Bioinformatics.

[17]  Timothy Baldwin,et al.  Word sense disambiguation for event trigger word detection in biomedicine , 2011, BMC Bioinformatics.

[18]  Halil Kilicoglu,et al.  Syntactic Dependency Based Heuristics for Biological Event Extraction , 2009, BioNLP@HLT-NAACL.

[19]  Koldo Gojenola,et al.  Using Kybots for Extracting Events in Biomedical Texts , 2011, BioNLP@ACL.

[20]  Quang Le Minh,et al.  A Pattern Approach for Biomedical Event Annotation , 2011, Proceedings of BioNLP Shared Task 2011 Workshop.

[21]  Jari Björne,et al.  Extracting Complex Biological Events with Rich Graph-Based Feature Sets , 2009, BioNLP@HLT-NAACL.

[22]  Jari Björne,et al.  University of Turku in the BioNLP'11 Shared Task , 2012, BMC Bioinformatics.

[23]  Yijia Zhang,et al.  Biomolecular event trigger detection using neighborhood hash features. , 2013, Journal of theoretical biology.