Using active learning and semantic clustering for noise reduction in distant supervision

The use of external databases to generate training data, also known as Distant Supervision, has become an effective way to train supervised relation extractors but this approach inherently suffers from noise. In this paper we propose a method for noise reduction in distantly supervised training data, using a discriminative classifier and semantic similarity between the contexts of the training examples. We describe an active learning strategy which exploits hierarchical clustering of the candidate training samples. To further improve the effectiveness of this approach, we study the use of several methods for dimensionality reduction of the training samples. We find that semantic clustering of training data combined with cluster-based active learning allows filtering the training data, hence facilitating the creation of a clean training set for relation extraction, at a reduced manual labeling cost.

[1]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[2]  Thomas Demeester,et al.  Ghent University-IBCN participation in TAC-KBP 2014 slot filling and cold start tasks , 2014 .

[3]  Ramesh Nallapati,et al.  Multi-instance Multi-label Learning for Relation Extraction , 2012, EMNLP.

[4]  Mirella Lapata,et al.  A Comparison of Vector-based Representations for Semantic Composition , 2012, EMNLP.

[5]  Andrew McCallum,et al.  Structured Relation Discovery using Generative Models , 2011, EMNLP.

[6]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[7]  Romaric Besançon,et al.  Semantic Clustering of Relations between Named Entities , 2014, PolTAL.

[8]  Andrew Y. Ng,et al.  Semantic Compositionality through Recursive Matrix-Vector Spaces , 2012, EMNLP.

[9]  Ralph Grishman,et al.  Infusion of Labeled Data into Distant Supervision for Relation Extraction , 2014, ACL.

[10]  Andrew McCallum,et al.  Modeling Relations and Their Mentions without Labeled Text , 2010, ECML/PKDD.

[11]  M. Surdeanu,et al.  Overview of the English Slot Filling Track at the TAC 2014 Knowledge Base Population Evaluation , 2014 .

[12]  Hiroshi Nakagawa,et al.  Reducing Wrong Labels in Distant Supervision for Relation Extraction , 2012, ACL.

[13]  Dietrich Klakow,et al.  A survey of noise reduction methods for distant supervision , 2013, AKBC '13.

[14]  Ralph Grishman,et al.  Active learning for relation type extension with local and global data views , 2012, CIKM '12.

[15]  Sanjoy Dasgupta,et al.  Hierarchical sampling for active learning , 2008, ICML '08.

[16]  Thomas Demeester,et al.  Ghent University - IBCN Participation in the TAC KBP 2015 Cold Start Slot Filling task , 2015, TAC.

[17]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[18]  Luke S. Zettlemoyer,et al.  Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations , 2011, ACL.

[19]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[20]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[21]  Ido Dagan,et al.  Investigating a Generic Paraphrase-Based Approach for Relation Extraction , 2006, EACL.

[22]  Ralph Grishman,et al.  Discovering Relations among Named Entities from Large Corpora , 2004, ACL.

[23]  Christopher D. Manning,et al.  Combining Distant and Partial Supervision for Relation Extraction , 2014, EMNLP.

[24]  Mihai Surdeanu Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling and Temporal Slot Filling , 2013, TAC.