Enriching Biomedical Events with Meta-knowledge

Owing to the ever increasing information deluge, it is becoming increasingly difficult to locate relevant information through traditional term-based search methods. Event?based text mining provides a more promising approach, as it also takes into account the semantic relationships between terms. Typical event representations only focus on identifying the type of the event, its par-ticipants and their types. However, additional information, which is essential for correct interpretation of the event, is often present in the text. This includes infor-mation about the polarity, certainty level, intensity/rate/frequency, type and source of the knowledge conveyed by the event. We refer to this additional information as meta-knowledge. This thesis focusses on our work involving the enrichment of events with meta-knowledge information. In this thesis we: ? describe the annotation scheme designed specifically to capture meta-knowledge information at the event level? report on the corpora that have been enriched through deployment of the meta-knowledge annotation scheme? describe the work on automated identification of meta-knowledge including: - a broad-ranging study on analysis and identification of polarity of bio-events using three different bio-event corpora - a detailed study on analysis and identification of knowledge source in bio-events found in abstracts as well as in full papers - a first study on analysis and identification of bio-event manner? describe the initial work on a new approach to discourse analysis based on me-ta-knowledge annotations at the event level

[1]  Ted Briscoe,et al.  Weakly Supervised Learning for Hedge Classification in Scientific Literature , 2007, ACL.

[2]  Timothy Baldwin,et al.  Biomedical Event Annotation with CRFs and Precision Grammars , 2009, BioNLP@HLT-NAACL.

[3]  Dietrich Rebholz-Schuhmann,et al.  Using argumentation to extract key sentences from biomedical abstracts , 2007, Int. J. Medical Informatics.

[4]  Sophia Ananiadou,et al.  Discovering and visualizing indirect associations between biomedical concepts , 2011, Bioinform..

[5]  Werner Ceusters,et al.  Negative findings in electronic health records and biomedical ontologies: A realist approach , 2007, Int. J. Medical Informatics.

[6]  Marco Guerini,et al.  Do Linguistic Style and Readability of Scientific Abstracts Affect their Virality? , 2012, ICWSM.

[7]  Halil Kilicoglu,et al.  Recognizing speculative language in biomedical research articles: a linguistically motivated perspective , 2008, BMC Bioinformatics.

[8]  G. Meade Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001 .

[9]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[10]  Hong Yu,et al.  Biomedical negation scope detection with conditional random fields , 2010, J. Am. Medical Informatics Assoc..

[11]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[12]  Ágnes Sándor,et al.  Modeling metadiscourse conveying the author's rhetorical strategy in biomedical research abstracts , 2007 .

[13]  Sophia Ananiadou,et al.  Evaluating a meta-knowledge annotation scheme for bio-events , 2010, NeSp-NLP@ACL.

[14]  Paul Buitelaar,et al.  Identifying the Epistemic Value of Discourse Segments in Biology Texts (project abstract) , 2009, IWCS.

[15]  Janyce Wiebe,et al.  Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis , 2005, HLT.

[16]  Padmini Srinivasan,et al.  Categorization of Sentence Types in Medical Abstracts , 2003, AMIA.

[17]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[18]  Goran Nenadic,et al.  Using SVMs with the Command Relation features to identify negated events in biomedical literature , 2010, NeSp-NLP@ACL.

[19]  Simone Teufel,et al.  Corpora for the Conceptualisation and Zoning of Scientific Papers , 2010, LREC.

[20]  Leo Hoye,et al.  Adverbs and Modality in English , 1997 .

[21]  Junichi Tsujii,et al.  Event extraction for systems biology by text mining the literature. , 2010, Trends in biotechnology.

[22]  Massimo Poesio,et al.  Negation of protein-protein interactions: analysis and extraction , 2007, ISMB/ECCB.

[23]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[24]  Victoria L. Rubin Stating with Certainty or Stating with Doubt: Intercoder Reliability Results for Manual Annotation of Epistemically Modalized Statements , 2007, NAACL.

[25]  Simon Buckingham Shum,et al.  Hypotheses, evidence and relationships: The HypER approach for representing scientific knowledge claims , 2009, ISWC 2009.

[26]  Sophia Ananiadou,et al.  FACTA: a text search engine for finding associated biomedical concepts , 2008, Bioinform..

[27]  John M. Swales,et al.  Genre Analysis: English in Academic and Research Settings , 1993 .

[28]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[29]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[30]  Sophia Ananiadou,et al.  Construction of an annotated corpus to support biomedical information extraction , 2009, BMC Bioinformatics.

[31]  Aaron N. Kaplan,et al.  Discovering Paradigm Shift Patterns in Biomedical Abstracts: Application to Neurodegenerative Diseases , 2005 .

[32]  G. Tottie Negation in English speech and writing : a study in variation , 1993 .

[33]  Yvan Saeys,et al.  Analyzing text in search of bio-molecular events: a high-precision machine learning framework , 2009, BioNLP@HLT-NAACL.

[34]  Veronika Vincze,et al.  Linguistic scope-based and biological event-based speculation and negation annotations in the Genia Event and BioScope corpora , 2010, Semantic Mining in Biomedicine.

[35]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[36]  Yael Garten,et al.  Recent progress in automatically extracting information from the pharmacogenomic literature. , 2010, Pharmacogenomics.

[37]  Tanya Reinhart,et al.  The syntactic domain of anaphora , 1976 .

[38]  Lior Rokach,et al.  Context-Sensitive Medical Information Retrieval , 2004, MedInfo.

[39]  Nigel Collier,et al.  Zone analysis in biology articles as a basis for information extraction , 2006, Int. J. Medical Informatics.

[40]  Yanjun Qi,et al.  Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources , 2004, Pacific Symposium on Biocomputing.

[41]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[42]  Hong Yu,et al.  The biomedical discourse relation bank , 2011, BMC Bioinformatics.

[43]  Simone Teufel Towards Discipline-Independent Argumentative Zoning : Evidence from Chemistry and Computational Linguistics , 2009 .

[44]  Petra Saskia Bayerl,et al.  Text Type Structure and Logical Document Structure , 2004, ACL 2004.

[45]  K. Hyland,et al.  Metadiscourse: Exploring Interaction in Writing , 2005 .

[46]  Roser Morante,et al.  A Metalearning Approach to Processing the Scope of Negation , 2009, CoNLL.

[47]  Yuzhen Ye,et al.  A Parsimony Approach to Biological Pathway Reconstruction/Inference for Genomes and Metagenomes , 2009, PLoS Comput. Biol..

[48]  Yang Huang,et al.  A novel hybrid approach to automated negation detection in clinical radiology reports. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[49]  Martin Krallinger Importance of negations and experimental qualifiers in biomedical literature , 2010, NeSp-NLP@ACL.

[50]  Isaac G. Councill,et al.  What's great and what's not: learning to classify the scope of negation for improved sentiment analysis , 2010, NeSp-NLP@ACL.

[51]  Anita de Waard,et al.  Identifying Claimed Knowledge Updates in Biomedical Research Articles , 2012, ACL 2012.

[52]  Geoffrey K. Pullum,et al.  A theory of command relations , 1990 .

[53]  Ilya M. Goldin,et al.  Learning to Detect Negation with ‘Not’ in Medical Texts , 2003 .

[54]  Anna Duszak,et al.  Academic discourse and intellectual styles , 1994 .

[55]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[56]  Jun'ichi Tsujii,et al.  Feature Forest Models for Probabilistic HPSG Parsing , 2008, CL.

[57]  Long H. Ngo,et al.  Implementation and Evaluation of Four Different Methods of Negation Detection , 2007 .

[58]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[59]  Michael Krauthammer,et al.  GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data , 2004, J. Biomed. Informatics.

[60]  Alfonso Valencia,et al.  Evaluation of BioCreAtIvE assessment of task 2 , 2005, BMC Bioinformatics.

[61]  Eva Haji The Prague Dependency Treebank: Crossing the Sentence Boundary , 1998 .

[62]  Fei Xia,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[63]  Alexander A. Morgan,et al.  Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup , 2003, ISMB.

[64]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[65]  Carl Kingsford,et al.  What are decision trees? , 2008, Nature Biotechnology.

[66]  Hagit Shatkay,et al.  Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users , 2008, Bioinform..

[67]  K. Hyland,et al.  Talking to the Academy , 1996 .

[68]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[69]  Jean Carletta,et al.  An annotation scheme for discourse-level argumentation in research articles , 1999, EACL.

[70]  Norman W. Paton,et al.  KiPar, a tool for systematic information retrieval regarding parameters for kinetic modelling of yeast metabolic pathways , 2009, Bioinform..

[71]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[72]  Harry Zhang,et al.  The Optimality of Naive Bayes , 2004, FLAIRS.

[73]  Dietrich Rebholz-Schuhmann,et al.  Automatic recognition of conceptualization zones in scientific articles and two life science applications , 2012, Bioinform..

[74]  Sophia Ananiadou,et al.  Text mining and its potential applications in systems biology. , 2006, Trends in biotechnology.

[75]  Laurence R. Horn A Natural History of Negation , 1989 .

[76]  D. Kell Metabolomics, modelling and machine learning in systems biology – towards an understanding of the languages of cells , 2006, The FEBS journal.

[77]  Fang Liu,et al.  Concept Negation in Free Text Components of Vaccine Safety Reports , 2006, AMIA.

[78]  Prakash M. Nadkarni,et al.  Research Paper: Use of General-purpose Negation Detection to Augment Concept Indexing of Medical Documents: A Quantitative Study Using the UMLS , 2001, J. Am. Medical Informatics Assoc..

[79]  K. Hyland,et al.  Writing Without Conviction? Hedging in Science Research Articles , 1996 .

[80]  Daniel Marcu,et al.  An Unsupervised Approach to Recognizing Discourse Relations , 2002, ACL.

[81]  Robert Stevens,et al.  e-Science and biological pathway semantics , 2007, BMC Bioinformatics.

[82]  Sanda M. Harabagiu,et al.  Negation, Contrast and Contradiction in Text Processing , 2006, AAAI.

[83]  Carlos Santos,et al.  Data and text mining Wnt pathway curation using automated natural language processing : combining statistical methods with partial and full parse for knowledge extraction , 2005 .

[84]  Hagit Shatkay,et al.  New directions in biomedical text annotation: definitions, guidelines and corpus construction , 2006, BMC Bioinformatics.

[85]  Pankaj Agarwal,et al.  Inferring pathways from gene lists using a literature-derived network of biological relationships , 2005, Bioinform..

[86]  Vassiliki Rizomilioti Exploring Epistemic Modality in Academic Discourse Using Corpora , 2006 .

[87]  Janyce Wiebe,et al.  Just How Mad Are You? Finding Strong and Weak Opinion Clauses , 2004, AAAI.

[88]  Mei Liu,et al.  Prediction of protein-protein interactions using random decision forest framework , 2005, Bioinform..

[89]  Markus J. Herrgård,et al.  A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology , 2008, Nature Biotechnology.

[90]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[91]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[92]  Svetla Boytcheva,et al.  Some Aspects of Negation Processing in Electronic Health Records , 2005 .

[93]  Lior Rokach,et al.  Negation recognition in medical narrative reports , 2008, Information Retrieval.

[94]  Nigel Collier,et al.  A baseline feature set for learning rhetorical zones using full articles in the biomedical domain , 2005, SKDD.

[95]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[96]  Dietrich Rebholz-Schuhmann,et al.  The BioLexicon: a large-scale terminological resource for biomedical text mining , 2011, BMC Bioinformatics.

[97]  Padmini Srinivasan,et al.  The Language of Bioscience: Facts, Speculations, and Statements In Between , 2004, HLT-NAACL 2004.

[98]  Naoaki Okazaki,et al.  Kleio: a knowledge-enriched information retrieval system for biology , 2008, SIGIR '08.

[99]  Simone Teufel,et al.  Argumentative zoning information extraction from scientific text , 1999 .

[100]  A. Waard A Classification of Research Verbs to Facilitate Discourse Segment Identification in Biological Text , 2010 .

[101]  Lluís Màrquez i Villodre,et al.  A Comparison between Supervised Learning Algorithms for Word Sense Disambiguation , 2000, CoNLL/LLL.

[102]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[103]  Roser Morante,et al.  Descriptive Analysis of Negation Cues in Biomedical Texts , 2010, LREC.

[104]  Jun'ichi Tsujii,et al.  From Text to Pathway: Corpus Annotation for Knowledge Acquisition from Biomedical Literature , 2007, APBC.

[105]  Sophia Ananiadou,et al.  Categorising Modality in Biomedical Texts , 2008, LREC 2008.

[106]  Dmitrij Frishman,et al.  The Negatome database: a reference set of non-interacting protein pairs , 2009, Nucleic Acids Res..

[107]  Jan Hajic,et al.  Prague Arabic Dependency Treebank: Development in Data and Tools , 2004 .

[108]  Carolyn Penstein Rosé,et al.  Generalizing Dependency Features for Opinion Mining , 2009, ACL.

[109]  Wen-Lian Hsu,et al.  BIOSMILE: A semantic role labeling system for biomedical verbs using a maximum-entropy model with automatically generated template features , 2007, BMC Bioinformatics.

[110]  Roser Morante,et al.  Corpus-based approaches to processing the scope of negation cues: an evaluation of the state of the art , 2011, IWCS.

[111]  Peter L. Elkin,et al.  A controlled trial of automated classification of negation from clinical notes , 2005, BMC Medical Informatics Decis. Mak..

[112]  János Csirik,et al.  The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes , 2008, BMC Bioinformatics.

[113]  Halil Kilicoglu,et al.  Syntactic Dependency Based Heuristics for Biological Event Extraction , 2009, BioNLP@HLT-NAACL.