HClaimE: A tool for identifying health claims in health news headlines

Abstract This study tackles the problem of extracting health claims from health research news headlines, in order to carry out veracity check. A health claim can be formally defined as a triplet consisting of an independent variable (IV – namely, what is being manipulated), a dependent variable (DV – namely, what is being measured), and the relation between the two. In this study, we develop HClaimE, an information extraction tool for identifying health claims in news headlines. Unlike the existing open information extraction (OpenIE) systems that rely on verbs as relation indicators, HClaimE focuses on finding relations between nouns, and draws on the linguistic characteristics of news headlines. HClaimE uses a Naive Bayes classifier that combines syntactic and lexical features for identifying IV and DV nouns, and recognizes relations between IV and DV through a rule-based method. We conducted an evaluation on a set of health news headlines from ScienceDaily.com, and the results show that HClaimE outperforms current OpenIE systems: the F-measures for identifying headlines without health claims is 0.60 and that for extracting IV-relation-DV is 0.69. Our study shows that nouns can provide more clues than verbs for identifying health claims in news headlines. Furthermore, it also shows that dependency relations and bag-of-words can distinguish IV-DV noun pairs from other noun pairs. In practice, HClaimE can be used as a helpful tool to identifying health claims in news headlines, which can then be further compared against authoritative health claims for veracity. Given the linguistic similarity between health claims and other causal claims, e.g., impacts of pollution on the environment, HClaimE may also be applicable for extracting claims in other domains.

[1]  Andi Rexha,et al.  An unsupervised aspect extraction strategy for monitoring real-time reviews stream , 2019, Inf. Process. Manag..

[2]  C. Gibson,et al.  Endangered species: science writers in the Canadian daily press , 1998 .

[3]  Chunhua Weng,et al.  EliIE: An open-source information extraction system for clinical trial eligibility criteria , 2017, J. Am. Medical Informatics Assoc..

[4]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[5]  Oren Etzioni,et al.  An analysis of open information extraction based on semantic role labeling , 2011, K-CAP '11.

[6]  Gaurav Sharma,et al.  Text Mining through Entity-Relationship Based Information Extraction , 2007, 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops.

[7]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[8]  Melinda Voss Checking the pulse: Midwestern reporters' opinions on their ability to report health care news. , 2002, American journal of public health.

[9]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[10]  Giovanni Siragusa,et al.  Discovering Relational Phrases for Qualia Roles Through Open Information Extraction , 2017, KESW.

[11]  Harinder Pal,et al.  Bootstrapping for Numerical Open IE , 2017, ACL.

[12]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[13]  Christopher D. Manning,et al.  Leveraging Linguistic Structure For Open Domain Information Extraction , 2015, ACL.

[14]  Ebrahim Bagheri,et al.  Self-training on refined clause patterns for relation extraction , 2017, Inf. Process. Manag..

[15]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[16]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[17]  Fei Li,et al.  A neural joint model for entity and relation extraction from biomedical text , 2017, BMC Bioinformatics.

[18]  Mausam,et al.  Open Information Extraction Systems and Downstream Applications , 2016, IJCAI.

[19]  Bei Yu,et al.  An Evaluation of Information Extraction Tools for Identifying Health Claims in News Headlines , 2018, EventStory@Coling.

[20]  André Freitas,et al.  Graphene: Semantically-Linked Propositions in Open Information Extraction , 2018, COLING.

[21]  Petroc Sumner,et al.  The association between exaggeration in health related science news and academic press releases: retrospective observational study , 2014, BMJ : British Medical Journal.

[22]  Jimmy J. Lin,et al.  REXTOR: A System for Generating Relations from Natural Language , 2000 .

[23]  Marcelo Fiszman,et al.  The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text , 2003, J. Biomed. Informatics.

[24]  Ting Wang,et al.  Using semantic similarity to reduce wrong labels in distant supervision for relation extraction , 2018, Inf. Process. Manag..

[25]  Harinder Pal,et al.  Demonyms and Compound Relational Nouns in Nominal Open IE , 2016, AKBC@NAACL-HLT.

[26]  R. Blendon,et al.  Health News and the American Public, 1996-2002 , 2003, Journal of health politics, policy and law.

[27]  Luciano Del Corro,et al.  MinIE: Minimizing Facts in Open Information Extraction , 2017, EMNLP.

[28]  Pierre Zweigenbaum,et al.  MEANS: A medical question-answering system combining NLP techniques and semantic Web technologies , 2015, Inf. Process. Manag..

[29]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[30]  Denilson Barbosa,et al.  Open Information Extraction with Tree Kernels , 2013, NAACL.

[31]  Luciano Del Corro,et al.  ClausIE: clause-based open information extraction , 2013, WWW.