Identification of research hypotheses and new knowledge from scientific literature

BackgroundText mining (TM) methods have been used extensively to extract relations and events from the literature. In addition, TM techniques have been used to extract various types or dimensions of interpretative information, known as Meta-Knowledge (MK), from the context of relations and events, e.g. negation, speculation, certainty and knowledge type. However, most existing methods have focussed on the extraction of individual dimensions of MK, without investigating how they can be combined to obtain even richer contextual information. In this paper, we describe a novel, supervised method to extract new MK dimensions that encode Research Hypotheses (an author’s intended knowledge gain) and New Knowledge (an author’s findings). The method incorporates various features, including a combination of simple MK dimensions.MethodsWe identify previously explored dimensions and then use a random forest to combine these with linguistic features into a classification model. To facilitate evaluation of the model, we have enriched two existing corpora annotated with relations and events, i.e., a subset of the GENIA-MK corpus and the EU-ADR corpus, by adding attributes to encode whether each relation or event corresponds to Research Hypothesis or New Knowledge. In the GENIA-MK corpus, these new attributes complement simpler MK dimensions that had previously been annotated.ResultsWe show that our approach is able to assign different types of MK dimensions to relations and events with a high degree of accuracy. Firstly, our method is able to improve upon the previously reported state of the art performance for an existing dimension, i.e., Knowledge Type. Secondly, we also demonstrate high F1-score in predicting the new dimensions of Research Hypothesis (GENIA: 0.914, EU-ADR 0.802) and New Knowledge (GENIA: 0.829, EU-ADR 0.836).ConclusionWe have presented a novel approach for predicting New Knowledge and Research Hypothesis, which combines simple MK dimensions to achieve high F1-scores. The extraction of such information is valuable for a number of practical TM applications.

[1]  Sophia Ananiadou,et al.  Enriching a biomedical event corpus with meta-knowledge annotation , 2011, BMC Bioinformatics.

[2]  M. McHugh Interrater reliability: the kappa statistic , 2012, Biochemia medica.

[3]  Karin M. Verspoor,et al.  Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts , 2016, BMC Medical Informatics and Decision Making.

[4]  Eduard H. Hovy,et al.  Automated detection of discourse segment and experimental types from the text of cancer pathway results sections , 2016, Database J. Biol. Databases Curation.

[5]  Hans-Peter Kriegel,et al.  Extraction of semantic biomedical relations from text using conditional random fields , 2008, BMC Bioinformatics.

[6]  Shuigeng Zhou,et al.  A comparison study on feature selection of DNA structural properties for promoter prediction , 2012, BMC Bioinformatics.

[7]  Kathleen Marchal,et al.  Evaluation of time profile reconstruction from complex two-color microarray designs , 2008, BMC Bioinformatics.

[8]  Sophia Ananiadou,et al.  Something Old, Something New: Identifying Knowledge Source in Bio-events , 2013, Int. J. Comput. Linguistics Appl..

[9]  János Csirik,et al.  The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes , 2008, BMC Bioinformatics.

[10]  Jiawen Li,et al.  The expression of interleukin-17, interferon-gamma, and macrophage inflammatory protein-3 alpha mRNA in patients with psoriasis vulgaris. , 2004, Journal of Huazhong University of Science and Technology. Medical sciences = Hua zhong ke ji da xue xue bao. Yi xue Ying De wen ban = Huazhong keji daxue xuebao. Yixue Yingdewen ban.

[11]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[12]  Claire Nédellec,et al.  Learning Language in Logic - Genic Interaction Extraction Challenge , 2005 .

[13]  Martijn J. Schuemie,et al.  Distribution of information in biomedical abstracts and full-text publications , 2004, Bioinform..

[14]  Núria Queralt-Rosinach,et al.  Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research , 2014, BMC Bioinformatics.

[15]  K. Scharffetter-Kochanek,et al.  Reduction of CD18 Promotes Expansion of Inflammatory γδ T Cells Collaborating with CD4+ T Cells in Chronic Murine Psoriasiform Dermatitis , 2013, The Journal of Immunology.

[16]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[17]  Arthur G. Palmer,et al.  Thermal Adaptation of Conformational Dynamics in Ribonuclease H , 2013, PLoS Comput. Biol..

[18]  Massimo Poesio,et al.  Negation of protein-protein interactions: analysis and extraction , 2007, ISMB/ECCB.

[19]  Wen Huang,et al.  MTML-msBayes: Approximate Bayesian comparative phylogeographic inference from multiple taxa and multiple loci with rate heterogeneity , 2011, BMC Bioinformatics.

[20]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[21]  Sophia Ananiadou,et al.  Enriching news events with meta-knowledge information , 2016, Language Resources and Evaluation.

[22]  Laura Inés Furlong,et al.  The EU-ADR corpus: Annotated drugs, diseases, targets, and their relationships , 2012, J. Biomed. Informatics.

[23]  Hong Yu,et al.  BioN∅T: A searchable database of biomedical negated sentences , 2011, BMC Bioinformatics.

[24]  Jun'ichi Tsujii,et al.  Event Extraction with Complex Event Classification Using Rich Features , 2010, J. Bioinform. Comput. Biol..

[25]  Jun'ichi Tsujii,et al.  Feature Forest Models for Probabilistic HPSG Parsing , 2008, CL.

[26]  Sophia Ananiadou,et al.  Negated bio-events: analysis and identification , 2013, BMC Bioinformatics.

[27]  Sophia Ananiadou,et al.  Meta-Knowledge Annotation at the Event Level: Comparison between Abstracts and Full Papers , 2012, LREC 2012.

[28]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[29]  Nigel Collier,et al.  Zone Identification in Biology Articles as a Basis for Information Extraction , 2004, NLPBA/BioNLP.

[30]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[31]  Halil Kilicoglu,et al.  Biological event composition , 2012, BMC Bioinformatics.

[32]  Bin Li,et al.  Protein docking prediction using predicted protein-protein interface , 2012, BMC Bioinformatics.

[33]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[34]  Rebecca Ferguson,et al.  XIP Dashboard: visual analytics from automated rhetorical parsing of scientific metadiscourse , 2013 .

[35]  Àlex Bravo,et al.  Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research , 2014 .

[36]  Martin Hofmann-Apitius,et al.  ‘HypothesisFinder:’ A Strategy for the Detection of Speculative Statements in Scientific Text , 2013, PLoS Comput. Biol..

[37]  Ted Briscoe,et al.  Weakly Supervised Learning for Hedge Classification in Scientific Literature , 2007, ACL.

[38]  Dietrich Rebholz-Schuhmann,et al.  Using argumentation to extract key sentences from biomedical abstracts , 2007, Int. J. Medical Informatics.

[39]  Sophia Ananiadou,et al.  Extracting semantically enriched events from biomedical literature , 2012, BMC Bioinformatics.

[40]  Sophia Ananiadou,et al.  Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature , 2011, BMC Bioinformatics.

[41]  Jari Björne,et al.  University of Turku in the BioNLP'11 Shared Task , 2012, BMC Bioinformatics.

[42]  Jean Carletta,et al.  An annotation scheme for discourse-level argumentation in research articles , 1999, EACL.

[43]  Sophia Ananiadou,et al.  Using uncertainty to link and rank evidence from biomedical literature for model curation , 2017, Bioinform..

[44]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[45]  Dietrich Rebholz-Schuhmann,et al.  Automatic recognition of conceptualization zones in scientific articles and two life science applications , 2012, Bioinform..