Semantically linking molecular entities in literature through entity relationships

BackgroundText mining tools have gained popularity to process the vast amount of available research articles in the biomedical literature. It is crucial that such tools extract information with a sufficient level of detail to be applicable in real life scenarios. Studies of mining non-causal molecular relations attribute to this goal by formally identifying the relations between genes, promoters, complexes and various other molecular entities found in text. More importantly, these studies help to enhance integration of text mining results with database facts.ResultsWe describe, compare and evaluate two frameworks developed for the prediction of non-causal or 'entity' relations (REL) between gene symbols and domain terms. For the corresponding REL challenge of the BioNLP Shared Task of 2011, these systems ranked first (57.7% F-score) and second (41.6% F-score). In this paper, we investigate the performance discrepancy of 16 percentage points by benchmarking on a related and more extensive dataset, analysing the contribution of both the term detection and relation extraction modules. We further construct a hybrid system combining the two frameworks and experiment with intersection and union combinations, achieving respectively high-precision and high-recall results. Finally, we highlight extremely high-performance results (F-score > 90%) obtained for the specific subclass of embedded entity relations that are essential for integrating text mining predictions with database facts.ConclusionsThe results from this study will enable us in the near future to annotate semantic relations between molecular entities in the entire scientific literature available through PubMed. The recent release of the EVEX dataset, containing biomolecular event predictions for millions of PubMed articles, is an interesting and exciting opportunity to overlay these entity relations with event predictions on a literature-wide scale.

[1]  Fredric C. Gey,et al.  Proceedings of LREC , 2010 .

[2]  Douglas L. T. Rohde,et al.  An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence , 2005 .

[3]  Sampo Pyysalo,et al.  Static Relations: a Piece in the Biomedical Information Extraction Puzzle , 2009, BioNLP@HLT-NAACL.

[4]  Jari Björne,et al.  Extracting Complex Biological Events with Rich Graph-Based Feature Sets , 2009, BioNLP@HLT-NAACL.

[5]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[6]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[7]  Jari Björne,et al.  University of Turku in the BioNLP'11 Shared Task , 2012, BMC Bioinformatics.

[8]  K. Bretonnel Cohen,et al.  Proceedings of the BioNLP 2009 Workshop , 2009 .

[9]  Yvan Saeys,et al.  Discriminative and informative features for biomolecular text mining with ensemble feature selection , 2010, Bioinform..

[10]  Vladimir Naumovich Vapni The Nature of Statistical Learning Theory , 1995 .

[11]  Lewis Y. Geer,et al.  Database resources of the National Center for Biotechnology Information , 2014, Nucleic Acids Res..

[12]  Sampo Pyysalo,et al.  Integration of Static Relations to Enhance Event Extraction from Text , 2010, BioNLP@ACL.

[13]  Fredric C. Gey,et al.  The Relationship between Recall and Precision , 1994, J. Am. Soc. Inf. Sci..

[14]  Antonio Reverter,et al.  Mining tissue specificity, gene connectivity and disease association to reveal a set of genes that modify the action of disease causing genes , 2008, BioData Mining.

[15]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[16]  Sampo Pyysalo,et al.  Proceedings of the BioNLP Shared Task 2011 Workshop , 2011 .

[17]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[18]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[19]  Eugene Charniak,et al.  Any Domain Parsing: Automatic Domain Adaptation for Natural Language Parsing , 2010 .

[20]  Christopher D. Manning,et al.  The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.

[21]  Jari Björne,et al.  Scaling up Biomedical Event Extraction to the Entire PubMed , 2010, BioNLP@ACL.

[22]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[23]  Sampo Pyysalo,et al.  Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011 , 2012, BMC Bioinformatics.

[24]  K. McRae,et al.  Proceedings of the 30th Annual Conference of the Cognitive Science Society. , 2008 .

[25]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[26]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[27]  S. Dongen Graph clustering by flow simulation , 2000 .

[28]  Jari Björne,et al.  Generalizing Biomedical Event Extraction , 2011, BioNLP@ACL.

[29]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[30]  Bernard De Baets,et al.  Detecting Entity Relations as a Supporting Task for Bio-Molecular Event Extraction , 2011, BioNLP@ACL.

[31]  Keith Stevens,et al.  The S-Space Package: An Open Source Package for Word Space Models , 2010, ACL.

[32]  María Martín,et al.  Ongoing and future developments at the Universal Protein Resource , 2010, Nucleic Acids Res..

[33]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[34]  Jun'ichi Tsujii,et al.  Event Extraction with Complex Event Classification Using Rich Features , 2010, J. Bioinform. Comput. Biol..

[35]  Dorit Merhof,et al.  HiTSEE KNIME: a visualization tool for hit selection and analysis in high-throughput screening experiments for the KNIME platform , 2012, BMC Bioinformatics.

[36]  P. Kanerva,et al.  Permutations as a means to encode order in word space , 2008 .

[37]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[38]  Curt Burgess,et al.  Modelling Parsing Constraints with High-dimensional Context Space , 1997 .

[39]  Halil Kilicoglu,et al.  Adapting a General Semantic Interpretation Approach to Biological Event Extraction , 2011, BioNLP@ACL.

[40]  Tapio Salakoski,et al.  EVEX: A PubMed-Scale Resource for Homology-Based Generalization of Text Mining Predictions , 2011, BioNLP@ACL.

[41]  Sampo Pyysalo,et al.  Overview of the Entity Relations (REL) supporting task of BioNLP Shared Task 2011 , 2011, BioNLP@ACL.

[42]  Yvan Saeys,et al.  HIGH‐PRECISION BIO‐MOLECULAR EVENT EXTRACTION FROM TEXT USING PARALLEL BINARY CLASSIFIERS , 2011, Comput. Intell..

[43]  Yvan Saeys,et al.  Extracting protein-protein interactions from text using rich feature vectors and feature selection , 2008, SMBM 2008.

[44]  Sampo Pyysalo,et al.  A Re-Evaluation of Biomedical Named Entity-Term Relations , 2010, J. Bioinform. Comput. Biol..

[45]  Kay Nieselt,et al.  iHAT: interactive Hierarchical Aggregation Table for Genetic Association Data , 2012, BMC Bioinformatics.

[46]  Martin Krallinger,et al.  Analysis of biological processes and diseases using text mining approaches. , 2010, Methods in molecular biology.

[47]  Johanna D. Moore,et al.  Implications for Generating Clarification Requests in Task-Oriented Dialogues , 2005, ACL.