Using uncertainty to link and rank evidence from biomedical literature for model curation

Abstract Motivation In recent years, there has been great progress in the field of automated curation of biomedical networks and models, aided by text mining methods that provide evidence from literature. Such methods must not only extract snippets of text that relate to model interactions, but also be able to contextualize the evidence and provide additional confidence scores for the interaction in question. Although various approaches calculating confidence scores have focused primarily on the quality of the extracted information, there has been little work on exploring the textual uncertainty conveyed by the author. Despite textual uncertainty being acknowledged in biomedical text mining as an attribute of text mined interactions (events), it is significantly understudied as a means of providing a confidence measure for interactions in pathways or other biomedical models. In this work, we focus on improving identification of textual uncertainty for events and explore how it can be used as an additional measure of confidence for biomedical models. Results We present a novel method for extracting uncertainty from the literature using a hybrid approach that combines rule induction and machine learning. Variations of this hybrid approach are then discussed, alongside their advantages and disadvantages. We use subjective logic theory to combine multiple uncertainty values extracted from different sources for the same interaction. Our approach achieves F-scores of 0.76 and 0.88 based on the BioNLP-ST and Genia-MK corpora, respectively, making considerable improvements over previously published work. Moreover, we evaluate our proposed system on pathways related to two different areas, namely leukemia and melanoma cancer research. Availability and implementation The leukemia pathway model used is available in Pathway Studio while the Ras model is available via PathwayCommons. Online demonstration of the uncertainty extraction system is available for research purposes at http://argo.nactem.ac.uk/test. The related code is available on https://github.com/c-zrv/uncertainty_components.git. Details on the above are available in the Supplementary Material. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Ben Medlock,et al.  Exploring hedge identification in biomedical literature , 2008, J. Biomed. Informatics.

[2]  Jari Björne,et al.  Complex event extraction at PubMed scale , 2010, Bioinform..

[3]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[4]  Sophia Ananiadou,et al.  Enriching news events with meta-knowledge information , 2016, Language Resources and Evaluation.

[5]  Audun Jøsang,et al.  Exploring Different Types of Trust Propagation , 2006, iTrust.

[6]  Danielle L. Mowery,et al.  Task 2 : ShARe/CLEF eHealth Evaluation Lab 2014 , 2013 .

[7]  Weizhong Zhao,et al.  Data mining tools for Salmonella characterization: application to gel-based fingerprinting analysis , 2013, BMC Bioinformatics.

[8]  Degen Huang,et al.  Hedge Scope Detection in Biomedical Texts: An Effective Dependency-Based Method , 2015, PloS one.

[9]  Paul R Cohen,et al.  DARPA's Big Mechanism program , 2015, Physical biology.

[10]  Sampo Pyysalo,et al.  Overview of BioNLP Shared Task 2013 , 2013, BioNLP@ACL.

[11]  Mark Gerstein,et al.  Getting Started in Text Mining: Part Two , 2009, PLoS Comput. Biol..

[12]  Adrian J Shepherd,et al.  Mining biological networks from full-text articles. , 2014, Methods in molecular biology.

[13]  Natalia Grabar,et al.  Assessment of Biomedical Knowledge According to Confidence Criteria , 2008, MIE.

[14]  Victoria L. Rubin Stating with Certainty or Stating with Doubt: Intercoder Reliability Results for Manual Annotation of Epistemically Modalized Statements , 2007, NAACL.

[15]  Sophia Ananiadou,et al.  Enriching a biomedical event corpus with meta-knowledge annotation , 2011, BMC Bioinformatics.

[16]  Yaoyun Zhang,et al.  UTH-CCB: The Participation of the SemEval 2015 Challenge – Task 14 , 2015, *SEMEVAL.

[17]  Audun Jøsang,et al.  A Logic for Uncertain Probabilities , 2001, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[18]  Martin H. Schaefer,et al.  HIPPIE: Integrating Protein Interaction Networks with Experiment Based Quality Scores , 2012, PloS one.

[19]  Allan Kuchinsky,et al.  An architecture for biological information extraction and representation , 2005, Bioinform..

[20]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[21]  Damian Szklarczyk,et al.  The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored , 2010, Nucleic Acids Res..

[22]  Jun'ichi Tsujii,et al.  Comparative Parser Performance Analysis across Grammar Frameworks through Automatic Tree Conversion using Synchronous Grammars , 2008, COLING.

[23]  Eun-Hee Kim DEREGULATION AND INVESTMENT IN GREEN TECHNOLOGIES: EVIDENCE FROM INVESTOR-OWNED ELECTRIC UTILITIES , 2011 .

[24]  Olfa Nasraoui,et al.  Building a glaucoma interaction network using a text mining approach , 2016, BioData Mining.

[25]  Martin Hofmann-Apitius,et al.  ‘HypothesisFinder:’ A Strategy for the Detection of Speculative Statements in Scientific Text , 2013, PLoS Comput. Biol..

[26]  Sampo Pyysalo,et al.  Bridging the Gap Between Scope-based and Event-based Negation/Speculation Annotations: A Bridge Not Too Far , 2012, ExProM@ACL.

[27]  Jari Björne,et al.  Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization , 2013, PloS one.

[28]  Anders Karlsson,et al.  Hyaluronic Acid Levels Predict Risk of Hepatic Encephalopathy and Liver-Related Death in HIV/Viral Hepatitis Coinfected Patients , 2013, PloS one.

[29]  Sophia Ananiadou,et al.  Negated bio-events: analysis and identification , 2013, BMC Bioinformatics.

[30]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[31]  Guodong Zhou,et al.  Tree Kernel-based Negation and Speculation Scope Detection with Structured Syntactic Parse Features , 2013, EMNLP.

[32]  Jari Björne,et al.  Generalizing Biomedical Event Extraction , 2011, BioNLP@ACL.

[33]  Halil Kilicoglu,et al.  A Compositional Interpretation of Biomedical Event Factuality , 2015 .

[34]  Sampo Pyysalo,et al.  Event extraction across multiple levels of biological organization , 2012, Bioinform..

[35]  János Csirik,et al.  The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes , 2008, BMC Bioinformatics.

[36]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[37]  Xiaolong Wang,et al.  A Cascade Method for Detecting Hedges and their Scope in Natural Language Text , 2010, CoNLL Shared Task.

[38]  Paul Portner,et al.  Toward Fine-grained Annotation of Modality in Text , 2013 .

[39]  Sophia Ananiadou,et al.  Extracting semantically enriched events from biomedical literature , 2012, BMC Bioinformatics.

[40]  Jari Björne,et al.  TEES 2.2: Biomedical Event Extraction for Diverse Corpora , 2015, BMC Bioinformatics.

[41]  Padmini Srinivasan,et al.  The Language of Bioscience: Facts, Speculations, and Statements In Between , 2004, HLT-NAACL 2004.

[42]  Stephan Oepen,et al.  Speculation and Negation: Rules, Rankers, and the Role of Syntax , 2012, CL.

[43]  Kalpana Raja,et al.  HPIminer: A text mining system for building and visualizing human protein interaction networks and pathways , 2015, J. Biomed. Informatics.

[44]  Allan Kuchinsky,et al.  An architecture for biological information extraction and representation , 2004, SAC '04.

[45]  Christian Blaschke,et al.  Text Mining for Metabolic Pathways, Signaling Cascades, and Protein Networks , 2005, Science's STKE.

[46]  Bo Kyung Kim,et al.  IT IS NOT JUST WHAT YOU HAVE BUT HOW YOU PRESENT IT: HOW SUBCATEGORIZATION AFFECTS OPERA MARKET IDENTITIES. , 2009 .

[47]  Reinhard Schneider,et al.  A survey of visualization tools for biological network analysis , 2008, BioData Mining.

[48]  Sampo Pyysalo,et al.  A method for integrating and ranking the evidence for biochemical pathways by mining reactions from text , 2013, Bioinform..

[49]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[50]  Sampo Pyysalo,et al.  Overview of the Cancer Genetics and Pathway Curation tasks of BioNLP Shared Task 2013 , 2015, BMC Bioinformatics.

[51]  Peer Bork,et al.  Extraction of regulatory gene/protein networks from Medline , 2006, Bioinform..

[52]  Jun'ichi Tsujii,et al.  New challenges for text mining: mapping between text and manually curated pathways , 2008, BMC Bioinformatics.

[53]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[54]  Sophia Ananiadou,et al.  Event-based text mining for biology and functional genomics , 2014, Briefings in functional genomics.

[55]  Sophia Ananiadou,et al.  Construction of an annotated corpus to support biomedical information extraction , 2009, BMC Bioinformatics.

[56]  D UllmanJeffrey,et al.  Dynamic itemset counting and implication rules for market basket data , 1997 .

[57]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[58]  János Csirik,et al.  The CoNLL-2010 Shared Task: Learning to Detect Hedges and their Scope in Natural Language Text , 2010, CoNLL Shared Task.

[59]  J. Rothberg,et al.  Gaining confidence in high-throughput protein interaction networks , 2004, Nature Biotechnology.

[60]  Sophia Ananiadou,et al.  Adaptable, high recall, event extraction system with minimal configuration , 2015, BMC Bioinformatics.

[61]  P. Aloy,et al.  Unveiling the role of network and systems biology in drug discovery. , 2010, Trends in pharmacological sciences.

[62]  Iryna Gurevych,et al.  Cross-Genre and Cross-Domain Detection of Semantic Uncertainty , 2012, CL.

[63]  Jie Zhang,et al.  Multi-source fusion in subjective logic , 2017, 2017 20th International Conference on Information Fusion (Fusion).