Committee-based Selection of Weakly Labeled Instances for Learning Relation Extraction

Manual annotation is a tedious and time consuming process, usually needed for generating training corpora to be used in a machine learning scenario. The distant supervision paradigm aims at automatically generating such corpora from structured data. The active learning paradigm aims at reducing the effort needed for manual annotation. We explore active and distant learning approaches jointly to limit the amount of automatically generated data needed for the use case of relation extraction by increasing the quality of the annotations. The main idea of using distantly labeled corpora is that they can simplify and speed-up the generation of models, e.g. for extracting relationships between enti- ties of interest, while the selection of instances is typically performed randomly. We propose the use of query-by-committee to select instances instead. This ap- proach is similar to the active learning paradigm, with a difference that unlabeled instances are weakly annotated, rather than by human experts. Different strategies using low or high confidence are compared to random selection. Experiments on publicly available data sets for detection of protein-protein interactions show a statistically significant improvement in F1 measure when adding instances with a high agreement of the committee.

[1]  Isabel Segura-Bedmar,et al.  The 1st DDIExtraction-2011 challenge task: Extraction of Drug-Drug Interactions from biomedical texts , 2011 .

[2]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[3]  William M. Pottenger,et al.  A semi-supervised active learning algorithm for information extraction from textual data , 2005, J. Assoc. Inf. Sci. Technol..

[4]  Jason Baldridge,et al.  Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons , 2010, ACL.

[5]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[6]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[7]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[8]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[9]  Fredrik Olsson,et al.  A literature survey of active machine learning in the context of natural language processing , 2009 .

[10]  Stan Matwin,et al.  Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity , 2006, Canadian Conference on AI.

[11]  Xue-wen Chen,et al.  KUPS: constructing datasets of interacting and non-interacting protein pairs with associated attributions , 2010, Nucleic Acids Res..

[12]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[13]  Daniel Berleant,et al.  Mining MEDLINE: Abstracts, Sentences, or Phrases? , 2001, Pacific Symposium on Biocomputing.

[14]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[15]  Ossama Emam,et al.  Unsupervised Information Extraction Approach Using Graph Mutual Reinforcement , 2006, EMNLP.

[16]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[17]  M. Walsh,et al.  An Introduction , 2002, The Counseling Psychologist.

[18]  Martin Hofmann-Apitius,et al.  Weakly Labeled Corpora as Silver Standard for Drug-Drug and Protein-Protein Interaction , 2012, LREC 2012.

[19]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[20]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[21]  Joe F. Zhou,et al.  Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, : 21-22 June 1999, University of Maryland, College Park, MD, USA , 1999 .

[22]  Claire Nédellec,et al.  Learning Language in Logic - Genic Interaction Extraction Challenge , 2005 .

[23]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[24]  Andrew McCallum,et al.  Modeling Relations and Their Mentions without Labeled Text , 2010, ECML/PKDD.

[25]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[26]  Elena Beisswanger,et al.  The Extraction of Pharmacogenetic and Pharmacogenomic Relations - A Case Study Using PharmGKB , 2011, Pacific Symposium on Biocomputing.

[27]  Ralph Grishman,et al.  Discovering Relations among Named Entities from Large Corpora , 2004, ACL.

[28]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[29]  Katrin Tomanek,et al.  Resource-aware annotation through active learning , 2010 .

[30]  Ulf Leser,et al.  Learning Protein–Protein Interaction Extraction using Distant Supervision , 2011 .

[31]  Rafael C. Jimenez,et al.  The IntAct molecular interaction database in 2012 , 2011, Nucleic Acids Res..

[32]  Martin Hofmann-Apitius,et al.  Improving Distantly Supervised Extraction of Drug-Drug and Protein-Protein Interactions , 2012 .

[33]  Andrew McCallum,et al.  Collective Cross-Document Relation Extraction Without Labelled Data , 2010, EMNLP.

[34]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[35]  Dietrich Rebholz-Schuhmann,et al.  The CALBC Silver Standard Corpus for Biomedical Named Entities - A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers , 2010, LREC.

[36]  Xiaohua Hu,et al.  Learning an enriched representation from unlabeled data for protein-protein interaction extraction , 2010, BMC Bioinformatics.

[37]  Eric Brill,et al.  Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging , 1995, VLC@ACL.

[38]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.