Improving Few Occurrence Feature Performance in Distant Supervision for Relation Extraction

Distant supervision is a hotspot in relation extraction research. Instead of relying on annotated text, distant supervision hires a knowledge base as supervision. For each pair of entities that appears in some knowledge base's relation, this approach find all sentences containing those entities in a large unlabeled corpus and extract textual features to train a relation classifier. The automatic labeling provides a large amount of data, but the data have serious problem. Most features appear only few times in training data, and such insufficient data make these features very susceptible to noise, which will lead to a flawed classifier. In this paper, we propose a method to improve few occurrence features' performance in distant supervision relation extraction. We present a novel model to calculating the similarity between a feature and an entity pair, and then adjust the entity pair' features by their similarity. The experiment shows our method boosted the performance of relation extraction.