Uncertainty Detection in Natural Language Texts

Uncertainty is an important linguistic phenomenon that is relevant in many fields of language processing. In its most general sense, it can be interpreted as lack of information: the hearer or the reader cannot be certain about some pieces of information. Thus, uncertain propositions are those whose truth value or reliability cannot be determined due to lack of information. Distinguishing between factual (i.e. true or false) and uncertain propositions is of primary importance both in linguistics and natural language processing applications. For instance, in information extraction an uncertain piece of information might be of some interest for an end-user as well, but such information must not be confused with factual textual evidence (reliable information) and the two should be kept separated. The main objective of this thesis is to detect uncertainty in English and Hungarian natural language texts. As opposed to earlier studies that focused on specific domains and were English-oriented, we will offer here a comprehensive approach to uncertainty detection, which can be easily adapted to the specific needs of many domains and languages. In our investigations, we will pay attention to create linguistically plausible models of uncertainty that will be exploited in creating manually annotated corpora that will serve as the base for the implementation of our uncertainty detectors for several domains, with the help of supervised machine learning techniques. Furthermore, we will also demonstrate that uncertainty detection can be fruitfully applied in a real-world application, namely, information extraction from clinical discharge summaries.

[1]  George Lakoff,et al.  Hedges: A Study In Meaning Criteria And The Logic Of Fuzzy Concepts , 1973 .

[2]  H. Grice Logic and conversation , 1975 .

[3]  Robert A. Day How to write and publish a scientific paper , 1979 .

[4]  Michael Swan,et al.  Practical English Usage , 1980 .

[5]  Penelope Brown,et al.  Politeness: Some Universals in Language Usage , 1989 .

[6]  A. Bell The language of news media , 1991 .

[7]  McGinnis Jm,et al.  Actual causes of death in the United States. , 1993 .

[8]  Carol Friedman,et al.  Research Paper: A General Natural-language Text Processor for Clinical Radiology , 1994, J. Am. Medical Informatics Assoc..

[9]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[10]  K. Hyland,et al.  Writing Without Conviction? Hedging in Science Research Articles , 1996 .

[11]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[12]  Expectations in Incremental Discourse Processing , 1997, ACL.

[13]  K. Hyland,et al.  Boosting, hedging and the negotiation of academic knowledge , 1998 .

[14]  J. Manson,et al.  Annual deaths attributable to obesity in the United States. , 1999, JAMA.

[15]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[16]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[17]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[18]  George Hripcsak,et al.  Research Paper: The Role of Domain Knowledge in Automating Medical Text Report Classification , 2003, J. Am. Medical Informatics Assoc..

[19]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[20]  Padmini Srinivasan,et al.  The Language of Bioscience: Facts, Speculations, and Statements In Between , 2004, HLT-NAACL 2004.

[21]  K. Hengeveld Mood and modality , 2004 .

[22]  J. Gerberding,et al.  Actual causes of death in the United States, 2000. , 2004, JAMA.

[23]  Dean F Sittig,et al.  Application of Information Technology j MediClass : A System for Detecting and Classifying Encounter-based Clinical Events in Any Electronic Medical , 2005 .

[24]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[25]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[26]  Vassiliki Rizomilioti Exploring Epistemic Modality in Academic Discourse Using Corpora , 2006 .

[27]  Maite Taboada,et al.  Methods for Creating Semantic Orientation Dictionaries , 2006, LREC.

[28]  Janyce Wiebe,et al.  Computing Attitude and Affect in Text: Theory and Applications , 2005, The Information Retrieval Series.

[29]  J. Csirik,et al.  Automatic extraction of semantic content from medical discharge records , 2006 .

[30]  Noriko Kando,et al.  Certainty Identification in Texts: Categorization Model and Manual Tagging Results , 2023 .

[31]  Christopher G. Chute,et al.  Research Paper: Automating the Assignment of Diagnosis Codes to Patient Encounters Using Example-based and Machine Learning Techniques , 2006, J. Am. Medical Informatics Assoc..

[32]  J. Opitz,et al.  Obesity: Genetic, molecular, and environmental aspects , 2007, American journal of medical genetics. Part A.

[33]  Wendy W. Chapman,et al.  ConText: An Algorithm for Identifying Contextual Features from Clinical Text , 2007, BioNLP@ACL.

[34]  K. Bretonnel Cohen,et al.  A shared task involving multi-label classification of clinical free text , 2007, BioNLP@ACL.

[35]  Ted Briscoe,et al.  Weakly Supervised Learning for Hedge Classification in Scientific Literature , 2007, ACL.

[36]  Yi Guan,et al.  Using Maximum Entropy Model to Extract Protein-Protein Interaction Information from Biomedical Literature , 2007, ICIC.

[37]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[38]  Johan Bos,et al.  Linguistically Motivated Large-Scale NLP with C&C and Boxer , 2007, ACL.

[39]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[40]  Mark Craven,et al.  Active Learning with Real Annotation Costs , 2008 .

[41]  Hagit Shatkay,et al.  Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users , 2008, Bioinform..

[42]  Richárd Farkas,et al.  Automatic construction of rule-based ICD-9-CM coding systems , 2008, BMC Bioinformatics.

[43]  Ben Wellner,et al.  The Mayo/MITRE System for Discovery of Obesity and Its Comorbidities , 2008 .

[44]  János Csirik,et al.  The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes , 2008, BMC Bioinformatics.

[45]  Sophia Ananiadou,et al.  Categorising Modality in Biomedical Texts , 2008, LREC 2008.

[46]  Halil Kilicoglu,et al.  Recognizing speculative language in biomedical research articles: a linguistically motivated perspective , 2008, BMC Bioinformatics.

[47]  James Pustejovsky,et al.  A factuality profiler for eventualities in text , 2008 .

[48]  S. Reeves,et al.  Discourse Analysis , 2018, Understanding Communication Research Methods.

[49]  György Szarvas,et al.  Hedge Classification in Biomedical Texts with a Weakly Supervised Selection of Keywords , 2008, ACL.

[50]  Yuan Luo,et al.  Identifying patient smoking status from medical discharge records. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[51]  Theresa Wilson Fine-grained subjectivity and sentiment analysis: recognizing the intensity, polarity, and attitudes of private states , 2008 .

[52]  János Csirik,et al.  Hungarian Word-Sense Disambiguated Corpus , 2008, LREC.

[53]  Goran Nenadic,et al.  Combining Lexical Profiling, Rules and Machine Learning for Disease Prediction from Hospital Discharge Summaries , 2008 .

[54]  Özlem Uzuner,et al.  Machine learning and rule-based approaches to assertion classification. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[55]  Son Doan,et al.  Using Hedges to Enhance a Disease Outbreak Report Text Mining System , 2009, BioNLP@HLT-NAACL.

[56]  Yvan Saeys,et al.  Analyzing text in search of bio-molecular events: a high-precision machine learning framework , 2009, BioNLP@HLT-NAACL.

[57]  James Pustejovsky,et al.  FactBank: a corpus annotated with event factuality , 2009, Lang. Resour. Evaluation.

[58]  Dragomir R. Radev,et al.  Detecting Speculations and their Scopes in Scientific Text , 2009, EMNLP.

[59]  Janyce Wiebe,et al.  Subjectivity Word Sense Disambiguation , 2009, EMNLP.

[60]  Özlem Uzuner,et al.  Viewpoint Paper: Recognizing Obesity and Comorbidities in Sparse Data , 2009, J. Am. Medical Informatics Assoc..

[61]  J. Wiebe Subjectivity Word Sense Disambiguation , 2009, EMNLP 2009.

[62]  Roser Morante,et al.  Joint Memory-Based Learning of Syntactic and Semantic Dependencies in Multiple Languages , 2009, CoNLL Shared Task.

[63]  Halil Kilicoglu,et al.  Syntactic Dependency Based Heuristics for Biological Event Extraction , 2009, BioNLP@HLT-NAACL.

[64]  Michael Strube,et al.  Finding Hedges by Chasing Weasels: Hedge Detection Using Wikipedia Tags and Shallow Linguistic Features , 2009, ACL.

[65]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[66]  Timothy Baldwin,et al.  Biomedical Event Annotation with CRFs and Precision Grammars , 2009, BioNLP@HLT-NAACL.

[67]  István Hegedüs,et al.  Research Paper: Semi-automated Construction of Decision Rules to Predict Morbidities from Clinical Texts , 2009, J. Am. Medical Informatics Assoc..

[68]  Weiwei Guo,et al.  Committed Belief Annotation and Tagging , 2009, Linguistic Annotation Workshop.

[69]  Roser Morante,et al.  Learning the Scope of Hedge Cues in Biomedical Texts , 2009, BioNLP@HLT-NAACL.

[70]  Walter Daelemans,et al.  Using Domain Similarity for Performance Estimation , 2010, ACL 2010.

[71]  Roser Morante,et al.  Memory-Based Resolution of In-Sentence Scopes of Hedge Cues , 2010, CoNLL Shared Task.

[72]  Stephan Oepen,et al.  Resolving Speculation: MaxEnt Cue Classification and Dependency-Based Scope Rules , 2010, CoNLL Shared Task.

[73]  Martin Krallinger Importance of negations and experimental qualifiers in biomedical literature , 2010, NeSp-NLP@ACL.

[74]  Ted Briscoe,et al.  Combining Manual Rules and Supervised Learning for Hedge Cue and Scope Detection , 2010, CoNLL Shared Task.

[75]  Gunnar Eriksson,et al.  Uncertainty Detection as Approximate Max-Margin Sequence Labelling , 2010, CoNLL Shared Task.

[76]  Andrea Esuli,et al.  SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining , 2010, LREC.

[77]  Xiaolong Wang,et al.  A Cascade Method for Detecting Hedges and their Scope in Natural Language Text , 2010, CoNLL Shared Task.

[78]  David Clausen,et al.  HedgeHunter: A System for Hedge Detection and Uncertainty Classification , 2010, CoNLL Shared Task.

[79]  Eraldo Rezende Fernandes,et al.  Hedge Detection Using the RelHunter Approach , 2010, CoNLL Shared Task.

[80]  János Csirik,et al.  The CoNLL-2010 Shared Task: Learning to Detect Hedges and their Scope in Natural Language Text , 2010, CoNLL Shared Task.

[81]  Carl Vogel,et al.  Exploiting CCG Structures with Tree Kernels for Speculation Detection , 2010, CoNLL Shared Task.

[82]  V. Vincze On the machine translatability of semi-compositional constructions , 2010 .

[83]  Isaac G. Councill,et al.  What's great and what's not: learning to classify the scope of negation for improved sentiment analysis , 2010, NeSp-NLP@ACL.

[84]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[85]  Maria Georgescul,et al.  A Hedgehop over a Max-Margin Framework Using Hedge Cues , 2010, CoNLL Shared Task.

[86]  Victoria L. Rubin Epistemic modality: From uncertainty to certainty in the context of information seeking as interactions with texts , 2010, Inf. Process. Manag..

[87]  Erik F. Tjong Kim Sang A Baseline Approach for Detecting Sentences Containing Uncertainty , 2010, CoNLL Shared Task.

[88]  Sophia Ananiadou,et al.  Evaluating a meta-knowledge annotation scheme for bio-events , 2010, NeSp-NLP@ACL.

[89]  Erik Velldal,et al.  Detecting uncertainty in biomedical literature: a simple disambiguation approach using sparse random indexing , 2010, Semantic Mining in Biomedicine.

[90]  János Csirik,et al.  Hungarian Corpus of Light Verb Constructions , 2010, COLING.

[91]  Guodong Zhou,et al.  Hedge detection and scope finding by sequence labeling with normalized feature selection , 2010, CoNLL 2010.

[92]  N. Katsos,et al.  Two experiments and some suggestions on the meaning of scalars and numerals , 2010 .

[93]  Veronika Vincze,et al.  Speculation and negation annotation in natural language texts: what the case of BioScope might (not) reveal , 2010, NeSp-NLP@ACL.

[94]  Xuan Wang,et al.  Exploiting Rich Features for Detecting Hedges and their Scope , 2010, CoNLL Shared Task.

[95]  Veronika Vincze,et al.  Domain-Dependent Identification of Multiword Expressions , 2011, RANLP.

[96]  Roser Morante,et al.  Overview of the QA4MRE Pilot Task: Annotating Modality and Negation for a Machine Reading Evaluation , 2011, CLEF.

[97]  Veronika Vincze,et al.  Detecting Noun Compounds and Light Verb Constructions: a Contrastive Study , 2011, MWE@ACL.

[98]  Veronika Vincze,et al.  Multiword Expressions and Named Entities in the Wiki50 Corpus , 2011, RANLP.

[99]  Veronika Vincze,et al.  Linguistic scope-based and biological event-based speculation and negation annotations in the BioScope and Genia Event corpora , 2011, J. Biomed. Semant..

[100]  Maite Taboada,et al.  A review corpus annotated for negation, speculation and their scope , 2012, LREC.

[101]  Roser Morante,et al.  Modality and Negation: An Introduction to the Special Issue , 2012, CL.

[102]  Christopher Potts,et al.  Did It Happen? The Pragmatic Complexity of Veridicality Assessment , 2012, CL.

[103]  Stephan Oepen,et al.  Speculation and Negation: Rules, Rankers, and the Role of Syntax , 2012, CL.

[104]  Iryna Gurevych,et al.  Cross-Genre and Cross-Domain Detection of Semantic Uncertainty , 2012, CL.

[105]  James Pustejovsky,et al.  Are You Sure That This Happened? Assessing the Factuality Degree of Events in Text , 2012, CL.

[106]  Veronika Vincze,et al.  magyarlanc: A Tool for Morphological and Dependency Parsing of Hungarian , 2013, RANLP.

[107]  Veronika Vincze,et al.  Weasels, Hedges and Peacocks: Discourse-level Uncertainty in Wikipedia Articles , 2013, IJCNLP.

[108]  Wei Gao,et al.  An Empirical Study on Uncertainty Identification in Social Media Context , 2013, ACL.

[109]  Noa P. Cruz Díaz Detecting Negated and Uncertain Information in Biomedical and Review Texts , 2013, RANLP.

[110]  Veronika Vincze,et al.  Uncertainty Detection in Hungarian Texts , 2014, COLING.

[111]  Veronika Vincze,et al.  Annotating Uncertainty in Hungarian Webtext , 2014, LAW@COLING.

[112]  K. Osenga LINGUISTICS AND PATENT CLAIM CONSTRUCTION , 2015 .

[113]  Lei Zhang,et al.  Sentiment Analysis and Opinion Mining , 2017, Encyclopedia of Machine Learning and Data Mining.

[114]  安平鎬,et al.  Evidentiality , 2018, A Grammar of Nganasan.